MatrixVector

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ws.qe
Class MatrixVector

java.lang.Object
  ws.qe.MatrixVector

public class MatrixVector
extends java.lang.Object

This class contains the methods for creating matrices and calculating vectors accordingly, based on the so-called "Local Association Clustering" and "Local Metric Clusterring" algorithms.

The implementation of this class is supported by Jakarta Lucene, for more information please see Jakarta Lucene ( javadoc ).

Author:: Sinian Zhang
See Also:: Tools.queryExpansionResult(String, ArrayList, ArrayList, float[], String)

Field Summary
`private int`	`numDocs` The number of documents.
`private int`	`numStems` The number of stems found in all the documents.
`private static java.lang.String`	`stemIndexPath` The index directory for the documents of stems.
`private java.util.ArrayList`	`stemSet` The set of the stems found in all the documents.
`private static java.lang.String`	`tokenIndexPath` The index directory for the documents of tokens.
`private java.util.ArrayList`	`tokenSet` The set of the tokens found in all the documents.

Constructor Summary
`MatrixVector(java.lang.String stemIndexPath, java.lang.String tokenIndexPath, java.util.ArrayList tokenSet, java.util.ArrayList stemSet)` Initializes a Matrix object.

Method Summary
`private int`	`Correlation_AC(java.util.Hashtable docStemMatrix, int u, int v)` This method calculates C(u, v), the so-called "unnormalized association corelation factor" based on the algorithm "Association Clustering".
`private static float`	`Correlation_MC(java.util.ArrayList tokenListu, int sizeu, java.util.ArrayList tokenListv, int sizev)` This method calculates C(u, v), the so-called "unnormalized metric corelation factor" based on the algorithm "Association Clustering".
`private static float`	`distanceTokens_MC(int[] posListu, int[] posListv)` This method calculates, in a document, the sum of the distance between all the tokens, which belong to two words.
`private static float`	`distanceWords_MC(java.util.Hashtable docPosu, java.util.Hashtable docPosv)` This method calculates the distance between two words in a document.
`protected java.util.Hashtable`	`getDocStemMatrix_AC()` Creates the so-called "document-stem-matrix" based on the algorithm "Association Clustering".
`protected float[]`	`getStemStemVector_AC(java.util.Hashtable docStemMatrix, java.lang.String query)` This method calculates one "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Association Clustering".
`static float[]`	`getTopStemStemVector_MC(java.lang.String query, java.util.ArrayList tokenSet, java.util.ArrayList stemSet, int[] topStemsPosition)` This method calculates only part of the "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Metric Clustering".

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

stemIndexPath

private static java.lang.String stemIndexPath

The index directory for the documents of stems.

See Also:: Tools.indexDocs(String, String), Tools.doIndexing(IndexWriter, File)

tokenIndexPath

private static java.lang.String tokenIndexPath

The index directory for the documents of tokens.

See Also:: Tools.indexDocs(String, String), Tools.doIndexing(IndexWriter, File)

stemSet

private java.util.ArrayList stemSet

The set of the stems found in all the documents.

See Also:: Tools.getSet(String)

tokenSet

private java.util.ArrayList tokenSet

The set of the tokens found in all the documents.

See Also:: Tools.getSet(String)

numDocs

private int numDocs

The number of documents.(Module I: equal or smaller than the number delivered by "Web"-box; Module II: the same number delivered by "Web"-box.)

See Also:: Google.getURLsList(JDialog, String, int), Tools.WebTextTokenStem(String[], String, String, String, String), Tools.UrlWebTextTokenStem(String, int)

numStems

private int numStems

The number of stems found in all the documents.

See Also:: Tools.getSet(String)

Constructor Detail

MatrixVector

public MatrixVector(java.lang.String stemIndexPath,
                    java.lang.String tokenIndexPath,
                    java.util.ArrayList tokenSet,
                    java.util.ArrayList stemSet)
             throws java.io.IOException

Initializes a Matrix object.
Parameters:: stemIndexPath - The index directory for the documents of stems.; tokenIndexPath - The index directory for the documents of tokens.; tokenSet - The set of the tokens found in all the documents.; stemSet - The set of the stems found in all the documents.
Throws:: java.io.IOException
See Also:: Tools.indexDocs(String, String), Tools.getSet(String)

Method Detail

getDocStemMatrix_AC

protected java.util.Hashtable getDocStemMatrix_AC()
                                           throws java.io.IOException

Creates the so-called "document-stem-matrix" based on the algorithm "Association Clustering".

Each element in this matrix quantifies the frequency of a stem in a document. Since the majority of the elements are zero, this matrix is to be saved in a hashtable, for instance the y-th (indexed)stem is f times present in the x-th (indexed)document, then "x:y"(String) is saved as key and "f"(int) as value in the hashtable.

Throws:: java.io.IOException
See Also:: org.apache.lucene.index.TermDocs

getStemStemVector_AC

protected float[] getStemStemVector_AC(java.util.Hashtable docStemMatrix,
                                       java.lang.String query)
                                throws java.io.IOException

This method calculates one "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Association Clustering".

In the "stem-stem-matrix" each element S(u, v) is the so-called "normalized associaton correlation factor" for two stems, the mathematical formel is: S(u, v)= C(u, v) / (C(u, u) + C(v, v) - C(u, v)), C(u, v) is the so-called "unnormalized association correlation factor" for two stems, C(u, v) is calculated from according to the co-occurence of two stems in a document("document-stem-matrix").

The whole vector is to be calculated, all the values are to be ranked, then for the the stems with the top values, their "metric correaltion factor" are to be calculated, these two factors determ finally the top stems as the expanded stems for the given query.

Parameters:: docStemMatrix - The so-called "document-stem-matrix".; query - The query, a English word, the same word delievered by "Word"-inputfield.
Returns:: A vector, each element quantifies the (normalized) frequency of co-occurrence for two stems.
Throws:: java.io.IOException
See Also:: Correlation_AC(Hashtable, int, int), getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), Tools.queryExpansionResult(String, ArrayList, ArrayList, float[], String)

Correlation_AC

private int Correlation_AC(java.util.Hashtable docStemMatrix,
                           int u,
                           int v)
                    throws java.io.IOException

This method calculates C(u, v), the so-called "unnormalized association corelation factor" based on the algorithm "Association Clustering".

Given u-th stem and v-th stem in the "stem-stem-matrix", the mathematical formel: C(u, v)= f(1, u) x f(1, v) + f(2, u) x f(2, v) + ... + f(n, u) x f(n, v), f(i, j) is the frenquency of a stem in a document, n is total number of the documents.

Parameters:: docStemMatrix - The so-called "document-stem-matrix".; u - The u-th row in "stem-stem-matrix".; v - The v-th column in "stem-stem-matrix".
Returns:: The value of the "unnormalized corelation factor".
Throws:: java.io.IOException
See Also:: getStemStemVector_AC(Hashtable, String)

getTopStemStemVector_MC

public static float[] getTopStemStemVector_MC(java.lang.String query,
                                              java.util.ArrayList tokenSet,
                                              java.util.ArrayList stemSet,
                                              int[] topStemsPosition)
                                       throws java.io.IOException

This method calculates only part of the "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Metric Clustering".

In the "stem-stem-matrix" each element S(u, v) is the so-called "normalized metric correlation factor" for two stems, the mathematical formel is: S(u, v)= C(u, v) / (|V(Su)| x |V(Sv)|), C(u, v) is the so-called "unnormalized metric correlation factor" for two stems, C(u, v) is calculated from according to the distance betwenn two stems in a document, |V| the size of word-set under one stem.

Not all the values in this vector, but only these for the stems with the top values in the "stem-stem-vector" based on the algorithm "Association Clustering", are to be calculated, these two factors determ finally the top stems as the expanded stems for the given query.

Parameters:: query - The query, an English word, the same word delievered by "Word"-inputfield.; tokenSet - The set of the tokens found in all the documents.; stemSet - The set of the stems found in all the documents.; topStemsPosition - The positions of the stems in stem-set, these stems have top values in "stem-stem-vector" based on the algorithm "Association Clustering".
Returns:: A list of values in "stem-stem-vector" based on the algorithm "Metric Clustering" for the stems, which have top values in "stem-stem-vector" based on the algorithm "Association Clustering".
Throws:: java.io.IOException
See Also:: Correlation_MC(ArrayList, int, ArrayList, int), distanceWords_MC(Hashtable, Hashtable), distanceTokens_MC(int[], int[]), getStemStemVector_AC(Hashtable, String), Tools.queryExpansionResult(String, ArrayList, ArrayList, float[], String)

Correlation_MC

private static float Correlation_MC(java.util.ArrayList tokenListu,
                                    int sizeu,
                                    java.util.ArrayList tokenListv,
                                    int sizev)
                             throws java.io.IOException

This method calculates C(u, v), the so-called "unnormalized metric corelation factor" based on the algorithm "Association Clustering".

Given u-th stem(has a word-set of size m, these words are uw1, uw2, ...) and v-th stem(has a word-set of size n, these words are vw1, vw2, ...) in the "stem-stem-matrix", the mathematical formel: C(u, v)= 1/d(uw1, vw1) + 1/d(uw1, vw2) + 1/d(uw1, vw3)... + 1/d(uw2, vw1) +1/d(uw2, vw2) + 1/d(uw2, vw3)+ ... + 1/d(uwi, vwj), i varies from 1 to m, j varies from 1 to n, d(uwi, vwj) is defined as the distance between two words in the same document, if one or both of them absent in a document, then 1/d(uwi, vwj) = 0

The distance between two words is defined as average distance of all theirs tokens in a document. For instace a word is three times present in a document, this word has three tokens, has three different positions in it. These words and their positions in a document are saved in a hashtable, ducument index as key, position list as value.

Parameters:: tokenListu - A list of words for the u-th stem in "stem-stem-matrix".; sizeu - The size of the list above.; tokenListv - A list of words for the u-th stem in "stem-stem-matrix".; sizev - The size of the list above.
Returns:: The value of the "unnormalized corelation factor".
Throws:: java.io.IOException
See Also:: getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), distanceTokens_MC(int[], int[]), distanceWords_MC(Hashtable, Hashtable), org.apache.lucene.index.TermPositions

distanceWords_MC

private static float distanceWords_MC(java.util.Hashtable docPosu,
                                      java.util.Hashtable docPosv)

This method calculates the distance between two words in a document.

The distance between two words is defined as average distance of all theirs tokens in a document. For instace a word is three times present in a document, then this word has three tokens in the document, has three different positions in it. For a stem its words and their positions in a document are saved in a hashtable, ducument index as key, position list as value.

Parameters:: docPosu - The hashtable saving all words and their position for the u-th stem in "stem-stem-matrix".; docPosv - The hashtable saving all words and their position for the v-th stem in "stem-stem-matrix".
Returns:: The distance between two words in a document.
See Also:: getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), distanceTokens_MC(int[], int[]), distanceWords_MC(Hashtable, Hashtable)

distanceTokens_MC

private static float distanceTokens_MC(int[] posListu,
                                       int[] posListv)

Parameters:: posListu - The Positions of all the tokens belonging to one word in a document, this word is one of the words sharing the same u-th stem in "stem-stem-matrix".; posListv - The Positions of all the tokens belonging to one word in a document, this word is one of the words sharing the same v-th stem in "stem-stem-matrix".
Returns:: The sum of the distance of all the tokens belonging to two words.
See Also:: getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), distanceTokens_MC(int[], int[]), distanceWords_MC(Hashtable, Hashtable)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ws.qe Class MatrixVector

stemIndexPath

tokenIndexPath

stemSet

tokenSet

numDocs

numStems

MatrixVector

getDocStemMatrix_AC

getStemStemVector_AC

Correlation_AC

getTopStemStemVector_MC

Correlation_MC

distanceWords_MC

distanceTokens_MC

ws.qe
Class MatrixVector