ws.qe
Class MatrixVector

java.lang.Object
  extended byws.qe.MatrixVector

public class MatrixVector
extends java.lang.Object

This class contains the methods for creating matrices and calculating vectors accordingly, based on the so-called "Local Association Clustering" and "Local Metric Clusterring" algorithms.

The implementation of this class is supported by Jakarta Lucene, for more information please see Jakarta Lucene ( javadoc ).

Author:
Sinian Zhang
See Also:
Tools.queryExpansionResult(String, ArrayList, ArrayList, float[], String)

Field Summary
private  int numDocs
          The number of documents.
private  int numStems
          The number of stems found in all the documents.
private static java.lang.String stemIndexPath
          The index directory for the documents of stems.
private  java.util.ArrayList stemSet
          The set of the stems found in all the documents.
private static java.lang.String tokenIndexPath
          The index directory for the documents of tokens.
private  java.util.ArrayList tokenSet
          The set of the tokens found in all the documents.
 
Constructor Summary
MatrixVector(java.lang.String stemIndexPath, java.lang.String tokenIndexPath, java.util.ArrayList tokenSet, java.util.ArrayList stemSet)
          Initializes a Matrix object.
 
Method Summary
private  int Correlation_AC(java.util.Hashtable docStemMatrix, int u, int v)
          This method calculates C(u, v), the so-called "unnormalized association corelation factor" based on the algorithm "Association Clustering".
private static float Correlation_MC(java.util.ArrayList tokenListu, int sizeu, java.util.ArrayList tokenListv, int sizev)
          This method calculates C(u, v), the so-called "unnormalized metric corelation factor" based on the algorithm "Association Clustering".
private static float distanceTokens_MC(int[] posListu, int[] posListv)
          This method calculates, in a document, the sum of the distance between all the tokens, which belong to two words.
private static float distanceWords_MC(java.util.Hashtable docPosu, java.util.Hashtable docPosv)
          This method calculates the distance between two words in a document.
protected  java.util.Hashtable getDocStemMatrix_AC()
          Creates the so-called "document-stem-matrix" based on the algorithm "Association Clustering".
protected  float[] getStemStemVector_AC(java.util.Hashtable docStemMatrix, java.lang.String query)
          This method calculates one "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Association Clustering".
static float[] getTopStemStemVector_MC(java.lang.String query, java.util.ArrayList tokenSet, java.util.ArrayList stemSet, int[] topStemsPosition)
          This method calculates only part of the "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Metric Clustering".
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

stemIndexPath

private static java.lang.String stemIndexPath
The index directory for the documents of stems.

See Also:
Tools.indexDocs(String, String), Tools.doIndexing(IndexWriter, File)

tokenIndexPath

private static java.lang.String tokenIndexPath
The index directory for the documents of tokens.

See Also:
Tools.indexDocs(String, String), Tools.doIndexing(IndexWriter, File)

stemSet

private java.util.ArrayList stemSet
The set of the stems found in all the documents.

See Also:
Tools.getSet(String)

tokenSet

private java.util.ArrayList tokenSet
The set of the tokens found in all the documents.

See Also:
Tools.getSet(String)

numDocs

private int numDocs
The number of documents.(Module I: equal or smaller than the number delivered by "Web"-box; Module II: the same number delivered by "Web"-box.)

See Also:
Google.getURLsList(JDialog, String, int), Tools.WebTextTokenStem(String[], String, String, String, String), Tools.UrlWebTextTokenStem(String, int)

numStems

private int numStems
The number of stems found in all the documents.

See Also:
Tools.getSet(String)
Constructor Detail

MatrixVector

public MatrixVector(java.lang.String stemIndexPath,
                    java.lang.String tokenIndexPath,
                    java.util.ArrayList tokenSet,
                    java.util.ArrayList stemSet)
             throws java.io.IOException
Initializes a Matrix object.

Parameters:
stemIndexPath - The index directory for the documents of stems.
tokenIndexPath - The index directory for the documents of tokens.
tokenSet - The set of the tokens found in all the documents.
stemSet - The set of the stems found in all the documents.
Throws:
java.io.IOException
See Also:
Tools.indexDocs(String, String), Tools.getSet(String)
Method Detail

getDocStemMatrix_AC

protected java.util.Hashtable getDocStemMatrix_AC()
                                           throws java.io.IOException
Creates the so-called "document-stem-matrix" based on the algorithm "Association Clustering".

Each element in this matrix quantifies the frequency of a stem in a document. Since the majority of the elements are zero, this matrix is to be saved in a hashtable, for instance the y-th (indexed)stem is f times present in the x-th (indexed)document, then "x:y"(String) is saved as key and "f"(int) as value in the hashtable.

Throws:
java.io.IOException
See Also:
org.apache.lucene.index.TermDocs

getStemStemVector_AC

protected float[] getStemStemVector_AC(java.util.Hashtable docStemMatrix,
                                       java.lang.String query)
                                throws java.io.IOException
This method calculates one "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Association Clustering".

In the "stem-stem-matrix" each element S(u, v) is the so-called "normalized associaton correlation factor" for two stems, the mathematical formel is: S(u, v)= C(u, v) / (C(u, u) + C(v, v) - C(u, v)), C(u, v) is the so-called "unnormalized association correlation factor" for two stems, C(u, v) is calculated from according to the co-occurence of two stems in a document("document-stem-matrix").

The whole vector is to be calculated, all the values are to be ranked, then for the the stems with the top values, their "metric correaltion factor" are to be calculated, these two factors determ finally the top stems as the expanded stems for the given query.

Parameters:
docStemMatrix - The so-called "document-stem-matrix".
query - The query, a English word, the same word delievered by "Word"-inputfield.
Returns:
A vector, each element quantifies the (normalized) frequency of co-occurrence for two stems.
Throws:
java.io.IOException
See Also:
Correlation_AC(Hashtable, int, int), getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), Tools.queryExpansionResult(String, ArrayList, ArrayList, float[], String)

Correlation_AC

private int Correlation_AC(java.util.Hashtable docStemMatrix,
                           int u,
                           int v)
                    throws java.io.IOException
This method calculates C(u, v), the so-called "unnormalized association corelation factor" based on the algorithm "Association Clustering".

Given u-th stem and v-th stem in the "stem-stem-matrix", the mathematical formel: C(u, v)= f(1, u) x f(1, v) + f(2, u) x f(2, v) + ... + f(n, u) x f(n, v), f(i, j) is the frenquency of a stem in a document, n is total number of the documents.

Parameters:
docStemMatrix - The so-called "document-stem-matrix".
u - The u-th row in "stem-stem-matrix".
v - The v-th column in "stem-stem-matrix".
Returns:
The value of the "unnormalized corelation factor".
Throws:
java.io.IOException
See Also:
getStemStemVector_AC(Hashtable, String)

getTopStemStemVector_MC

public static float[] getTopStemStemVector_MC(java.lang.String query,
                                              java.util.ArrayList tokenSet,
                                              java.util.ArrayList stemSet,
                                              int[] topStemsPosition)
                                       throws java.io.IOException
This method calculates only part of the "stem-stem-vector" for a given stemmed query in the normalized "stem-stem-matrix" based on the algorithm "Metric Clustering".

In the "stem-stem-matrix" each element S(u, v) is the so-called "normalized metric correlation factor" for two stems, the mathematical formel is: S(u, v)= C(u, v) / (|V(Su)| x |V(Sv)|), C(u, v) is the so-called "unnormalized metric correlation factor" for two stems, C(u, v) is calculated from according to the distance betwenn two stems in a document, |V| the size of word-set under one stem.

Not all the values in this vector, but only these for the stems with the top values in the "stem-stem-vector" based on the algorithm "Association Clustering", are to be calculated, these two factors determ finally the top stems as the expanded stems for the given query.

Parameters:
query - The query, an English word, the same word delievered by "Word"-inputfield.
tokenSet - The set of the tokens found in all the documents.
stemSet - The set of the stems found in all the documents.
topStemsPosition - The positions of the stems in stem-set, these stems have top values in "stem-stem-vector" based on the algorithm "Association Clustering".
Returns:
A list of values in "stem-stem-vector" based on the algorithm "Metric Clustering" for the stems, which have top values in "stem-stem-vector" based on the algorithm "Association Clustering".
Throws:
java.io.IOException
See Also:
Correlation_MC(ArrayList, int, ArrayList, int), distanceWords_MC(Hashtable, Hashtable), distanceTokens_MC(int[], int[]), getStemStemVector_AC(Hashtable, String), Tools.queryExpansionResult(String, ArrayList, ArrayList, float[], String)

Correlation_MC

private static float Correlation_MC(java.util.ArrayList tokenListu,
                                    int sizeu,
                                    java.util.ArrayList tokenListv,
                                    int sizev)
                             throws java.io.IOException
This method calculates C(u, v), the so-called "unnormalized metric corelation factor" based on the algorithm "Association Clustering".

Given u-th stem(has a word-set of size m, these words are uw1, uw2, ...) and v-th stem(has a word-set of size n, these words are vw1, vw2, ...) in the "stem-stem-matrix", the mathematical formel: C(u, v)= 1/d(uw1, vw1) + 1/d(uw1, vw2) + 1/d(uw1, vw3)... + 1/d(uw2, vw1) +1/d(uw2, vw2) + 1/d(uw2, vw3)+ ... + 1/d(uwi, vwj), i varies from 1 to m, j varies from 1 to n, d(uwi, vwj) is defined as the distance between two words in the same document, if one or both of them absent in a document, then 1/d(uwi, vwj) = 0

The distance between two words is defined as average distance of all theirs tokens in a document. For instace a word is three times present in a document, this word has three tokens, has three different positions in it. These words and their positions in a document are saved in a hashtable, ducument index as key, position list as value.

Parameters:
tokenListu - A list of words for the u-th stem in "stem-stem-matrix".
sizeu - The size of the list above.
tokenListv - A list of words for the u-th stem in "stem-stem-matrix".
sizev - The size of the list above.
Returns:
The value of the "unnormalized corelation factor".
Throws:
java.io.IOException
See Also:
getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), distanceTokens_MC(int[], int[]), distanceWords_MC(Hashtable, Hashtable), org.apache.lucene.index.TermPositions

distanceWords_MC

private static float distanceWords_MC(java.util.Hashtable docPosu,
                                      java.util.Hashtable docPosv)
This method calculates the distance between two words in a document.

The distance between two words is defined as average distance of all theirs tokens in a document. For instace a word is three times present in a document, then this word has three tokens in the document, has three different positions in it. For a stem its words and their positions in a document are saved in a hashtable, ducument index as key, position list as value.

Parameters:
docPosu - The hashtable saving all words and their position for the u-th stem in "stem-stem-matrix".
docPosv - The hashtable saving all words and their position for the v-th stem in "stem-stem-matrix".
Returns:
The distance between two words in a document.
See Also:
getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), distanceTokens_MC(int[], int[]), distanceWords_MC(Hashtable, Hashtable)

distanceTokens_MC

private static float distanceTokens_MC(int[] posListu,
                                       int[] posListv)
This method calculates, in a document, the sum of the distance between all the tokens, which belong to two words.

For instace word1 is 2 times present in a document with poitions pa, pb, and word2 is 3 times present with px, py, pz, then the sum of distance between all the tokens: distance-tokens = |pa - px| + |pa - py| + |pa - pz| + |pb - px| + |pb - py| + |pb - pz|, then is distance between word1 and word2: distance-words = distance-tokens/(2 x 3).

Parameters:
posListu - The Positions of all the tokens belonging to one word in a document, this word is one of the words sharing the same u-th stem in "stem-stem-matrix".
posListv - The Positions of all the tokens belonging to one word in a document, this word is one of the words sharing the same v-th stem in "stem-stem-matrix".
Returns:
The sum of the distance of all the tokens belonging to two words.
See Also:
getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), distanceTokens_MC(int[], int[]), distanceWords_MC(Hashtable, Hashtable)