ws.qe
Class Tools

java.lang.Object
  extended byws.qe.Tools

public class Tools
extends java.lang.Object

This class contains the methods for creating necessary files or data to support the whole program, for query expansion, and for data exchange in GUI as well.

The implementation of this class is support by external libraries Jakarta Lucene and Snowball Stemmer for Lucene, for more information please see Jakarta Lucene ( javadoc ) and Snowball Stemer for Lucene ( download ).

Author:
Sinia Zhang

Nested Class Summary
private  class Tools.getDocWeb
          This private class contains the run method for downloading a website.
 
Field Summary
private static int CHECK
          One website downloading and processing is to be checked 5 times(one check pro second) within 5 seconds, wheather it is ended successfully.
static java.lang.String[] ENGLISH_STOPWORDS
          A list of English stop words.
private  int numOfDocs
          The total number of the successfully downloaded websites.
private static int SLEEP
          1000 miliiseconds(1 second) wait-time between two checks.
private  int status
          The status(success, no-success because of out-run time-limit, or Google search fault) of a website downloading and processing.
static int TIME_LIMIT
          The time-limit(5 seconds) for downloading and processing one website.
static java.lang.String[] WEB_STOPWORDS
          A list of the so-called web-stopwords.
 
Constructor Summary
Tools()
          Initializes a Tools object.
 
Method Summary
static boolean areLetters(java.lang.String str)
          Checks, wheather a string is only composed of letters(without figures or special symbols).
static void deleteStemInDBList(java.lang.String stem, java.lang.String dbListFile)
          Deletes a stem out of the file(out of the database), which registers all the stems in the database.
static void deleteStemInfoFile(java.lang.String stem)
          Deletes the file, which saves the date of the download and analysis(query expansion), the number of the websites downloaded, its expanded stems and its token-set.
private  void doIndexing(org.apache.lucene.index.IndexWriter writer, java.io.File file)
          This method dose the indexing recursively, all the files unter a directory and its sub-directories are to be indexed.
static void emptyCreateDirectory(java.lang.String path)
          Empties a directory, if it exists, or creates a new one.
static java.lang.String[] getDataList(java.lang.String dbListFile)
          Prints all the stems in the database registered in a file into the "DATABASE"-list of the GUI.
static java.lang.String[] getDateWebFromStemInfoFile(java.lang.String stem)
          Gets the date of download and analysis(query expansion) and the number of websites downloaded for a stem.
protected  void getDocStem(java.lang.String inFile, java.lang.String outFile)
          This method gets all the tokens(words) in a file stemmed, writes the stems in another file.
protected  void getDocText(java.lang.String inFile, java.lang.String outFile)
          This method extracts text form a given html-file, saves it in a txt-file.
protected  void getDocToken(java.lang.String inFile, java.lang.String outFile)
          This method tokenizes a text, and extracts tokens from it, save them in another file.
private  java.lang.String getNum(int num)
          Converts a int number to a string, for instance "3" to "003", "12" to "012".
protected  java.util.ArrayList getSet(java.lang.String indexPath)
          This method gets a set of tokens or stems, given a index-directory.
static java.lang.String getStem(java.lang.String token)
          This method gets a given token/word stemmed.
static java.lang.String getStemFile(java.lang.String stem, int top)
          This method prints the top expanded stems and its token-set in the output of the GUI.
static java.util.Hashtable getStemTokenTable(java.util.ArrayList tokenSet)
          This method converts a given list(a set) of tokens to a hashtable, the keys represent a set of stems, so each key is a stem, its value is a set of tokens for this stem.
static java.lang.String getTokensFromStemInfoFile(java.lang.String stem)
          Gets the token-set for a stem from a file, this file saves the date of query expansion, the number of the downloaded websites, its top expanded stems as well, this token-set is to be showed in the "DATABASE"-list(tip-text) of the GUI.
static boolean inDBList(java.lang.String query, java.lang.String dbListFile)
          Checks, wheather a stem is registered already in the database.
 void indexDocs(java.lang.String filePath, java.lang.String indexPath)
          This method creats and maintains an index for a directory, all the files under this directory are to be indexed.
static boolean isStopWord(java.lang.String word)
          Checks, wheather a token/word is a stopword.
static boolean isWebStopword(java.lang.String word)
          Checks, wheather a token/word is a web-stopword.
private  int numOfNotZeroElements(float[] stemStemVector)
          Checks the total number of not-zero-elements in the "stem-stem-matrix"(association correlation factor) to quarantee that there are enough stems as candidates for the query expansion.
private  void printQeResult(java.lang.String query, java.lang.String[] rankedStemList, java.util.Hashtable stemTokenTable, int numOfExpandedStems, java.lang.String qeResultFile)
          The method prints and saves a result of query expansion in a file, the result contains the 10 top expanded stems and their token-set.
 void queryExpansionResult(java.lang.String query, java.util.ArrayList tokenSet, java.util.ArrayList stemSet, float[] stemStemVector_AC, java.lang.String dbListFile)
          This method gets the results of the query expansion, updates the database and the output in GUI as well.
private  void updateDBListFile(java.lang.String query, java.lang.String dbListFile)
          Updates the file, which registers all the stems in the database.
 void updateStemInfoFile(java.lang.String query, int web, java.util.Hashtable stemTokenTable, java.lang.String[] rankedStemList)
          Updates a file for a stem, this file saves the date of the download and analysis(query expansion), the number of the websites downloaded, its expanded stems and its token-set.
protected  void UrlWebTextTokenStem(java.lang.String query, int numOfDocuments)
          This method gets URLs as Google search results for a query, which is delivered by "Web"-inputfield, and downloads and processes the websites.
protected  void WebTextTokenStem(java.lang.String[] urlsList, java.lang.String webPath, java.lang.String textPath, java.lang.String tokenPath, java.lang.String stemPath)
          This method, given a list of URLs, downloads and processes the websites.
private  boolean WebTextTokenStem(java.lang.String urlAddress, int urlNr, java.lang.String webPath, java.lang.String textPath, java.lang.String tokenPath, java.lang.String stemPath)
          This method downloades and processes a website.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TIME_LIMIT

public static final int TIME_LIMIT
The time-limit(5 seconds) for downloading and processing one website.

See Also:
Constant Field Values

CHECK

private static final int CHECK
One website downloading and processing is to be checked 5 times(one check pro second) within 5 seconds, wheather it is ended successfully.

See Also:
Constant Field Values

SLEEP

private static final int SLEEP
1000 miliiseconds(1 second) wait-time between two checks.

See Also:
Constant Field Values

status

private int status
The status(success, no-success because of out-run time-limit, or Google search fault) of a website downloading and processing.


numOfDocs

private int numOfDocs
The total number of the successfully downloaded websites.


ENGLISH_STOPWORDS

public static final java.lang.String[] ENGLISH_STOPWORDS
A list of English stop words.

This list is MySQL-4.0.20 stop word list .


WEB_STOPWORDS

public static final java.lang.String[] WEB_STOPWORDS
A list of the so-called web-stopwords.

Constructor Detail

Tools

public Tools()
Initializes a Tools object.

Method Detail

UrlWebTextTokenStem

protected void UrlWebTextTokenStem(java.lang.String query,
                                   int numOfDocuments)
                            throws java.io.IOException,
                                   com.google.soap.search.GoogleSearchFault
This method gets URLs as Google search results for a query, which is delivered by "Web"-inputfield, and downloads and processes the websites.

This method is only to be called in "Module II", the given number of websites are to be downloaded and processed, the download and process of each website takes at most 5 seconds.

Parameters:
query - The query, an English word, which is delivered by "Word"-inputfiled.
numOfDocuments - The total number of documents finally downloaded for the further analysis.
Throws:
java.io.IOException
com.google.soap.search.GoogleSearchFault
See Also:
WebTextTokenStem(String, int, String, String, String, String)

WebTextTokenStem

private boolean WebTextTokenStem(java.lang.String urlAddress,
                                 int urlNr,
                                 java.lang.String webPath,
                                 java.lang.String textPath,
                                 java.lang.String tokenPath,
                                 java.lang.String stemPath)
                          throws java.io.IOException
This method downloades and processes a website.

This method is only to be called in "Module II", it performs the function of getDocWeb,getDocText(), getDocToken() and getDocStem() together.

Parameters:
urlAddress - A URL as Google search result.
urlNr - The index number of a URL.
webPath - A directory saving the downloaded websites.
textPath - A directory saving the documents of texts.
tokenPath - A directory saving the documents of tokens.
stemPath - A directory saving the documents of stems.
Returns:
true, if a website is successful within the time-limit downloaded and processed, otherwise false.
Throws:
java.io.IOException
See Also:
Tools.getDocWeb, getDocText(String, String), getDocToken(String, String), getDocStem(String, String)

WebTextTokenStem

protected void WebTextTokenStem(java.lang.String[] urlsList,
                                java.lang.String webPath,
                                java.lang.String textPath,
                                java.lang.String tokenPath,
                                java.lang.String stemPath)
                         throws java.io.IOException
This method, given a list of URLs, downloads and processes the websites.

This method is only to be called in "Module I", the download and process of each website takes at most 5 seconds. It performs the function of getDocWeb,getDocText(), getDocToken() and getDocStem() together.

Parameters:
urlsList - A List of URLs as Google search results.
webPath - A directory saving the downloaded websites.
textPath - A directory saving the documents of texts.
tokenPath - A directory saving the documents of tokens.
stemPath - A directory saving the documents of stems.
Throws:
java.io.IOException
See Also:
Tools.getDocWeb, getDocText(String, String), getDocToken(String, String), getDocStem(String, String)

emptyCreateDirectory

public static void emptyCreateDirectory(java.lang.String path)
Empties a directory, if it exists, or creates a new one.

Parameters:
path - The path of this directory.

isStopWord

public static boolean isStopWord(java.lang.String word)
Checks, wheather a token/word is a stopword.

Parameters:
word - A token/word to be chekced.
Returns:
true or false.
See Also:
ENGLISH_STOPWORDS

isWebStopword

public static boolean isWebStopword(java.lang.String word)
Checks, wheather a token/word is a web-stopword.

Parameters:
word - A token/word to be checked.
Returns:
true or false.
See Also:
WEB_STOPWORDS

getNum

private java.lang.String getNum(int num)
Converts a int number to a string, for instance "3" to "003", "12" to "012".

Parameters:
num - The number to be converted.
Returns:
The string representing the number.

getDocText

protected void getDocText(java.lang.String inFile,
                          java.lang.String outFile)
                   throws java.io.IOException
This method extracts text form a given html-file, saves it in a txt-file.

Not only the <tags> and the content in it are to be eliminated, but also the content between tags e.g. <script></script> and <style></style>, etc.

Parameters:
inFile - A file with html-format, in which a downloaded website is saved.
outFile - A file, in which the extracted texts are written.
Throws:
java.io.IOException
See Also:
WebTextTokenStem(String, int, String, String, String, String), WebTextTokenStem(String[], String, String, String, String)

getDocToken

protected void getDocToken(java.lang.String inFile,
                           java.lang.String outFile)
                    throws java.io.IOException
This method tokenizes a text, and extracts tokens from it, save them in another file.

The method filters a text, getting rid of specials symbols, punctuations, figures and stopwords.

Parameters:
inFile - A file, which contains a text to be tokenized.
outFile - A file, in which the extracted tokens are to be written.
Throws:
java.io.IOException
See Also:
WebTextTokenStem(String, int, String, String, String, String), WebTextTokenStem(String[], String, String, String, String)

getDocStem

protected void getDocStem(java.lang.String inFile,
                          java.lang.String outFile)
                   throws java.io.IOException
This method gets all the tokens(words) in a file stemmed, writes the stems in another file.

The method is supported by Snowball Stemmer, which is called improved "Porter2"-stemmer, for more information please see Snowball Stemer for Lucene ( download ).

Parameters:
inFile - A File, in which the tokens are to be stemmed.
outFile - A File, in which the stems are to be written.
Throws:
java.io.IOException
See Also:
WebTextTokenStem(String, int, String, String, String, String), WebTextTokenStem(String[], String, String, String, String)

getStem

public static java.lang.String getStem(java.lang.String token)
                                throws java.io.IOException
This method gets a given token/word stemmed.

Parameters:
token - A token/word, which is to be stemmed.
Returns:
The Stem of a token/word.
Throws:
java.io.IOException

areLetters

public static boolean areLetters(java.lang.String str)
Checks, wheather a string is only composed of letters(without figures or special symbols).

Parameters:
str - A string to be checked.
Returns:
true or false.

indexDocs

public void indexDocs(java.lang.String filePath,
                      java.lang.String indexPath)
This method creats and maintains an index for a directory, all the files under this directory are to be indexed.

Parameters:
filePath - The directory, under which all the files are to be indexed.
indexPath - The directory maintaining the created index.
See Also:
doIndexing(IndexWriter, File), org.apache.lucene.index.IndexWriter

doIndexing

private void doIndexing(org.apache.lucene.index.IndexWriter writer,
                        java.io.File file)
                 throws java.lang.Exception
This method dose the indexing recursively, all the files unter a directory and its sub-directories are to be indexed.

Parameters:
writer - The indexWriter.
file - A directory, under which all the files are to be indexed, or a file to be indexed.
Throws:
java.lang.Exception
See Also:
indexDocs(String, String), org.apache.lucene.document.Document

getSet

protected java.util.ArrayList getSet(java.lang.String indexPath)
                              throws java.io.IOException
This method gets a set of tokens or stems, given a index-directory.

Parameters:
indexPath - The directory maining the index of all the documents of tokens or stems.
Returns:
A list(a set) of tokens or stems.
Throws:
java.io.IOException
See Also:
indexDocs(String, String)

getStemTokenTable

public static java.util.Hashtable getStemTokenTable(java.util.ArrayList tokenSet)
                                             throws java.io.IOException
This method converts a given list(a set) of tokens to a hashtable, the keys represent a set of stems, so each key is a stem, its value is a set of tokens for this stem.

Parameters:
tokenSet - The set of tokens found in all the documents.
Returns:
A hashtable, each element contains a stem as key, and a set of its tokens as value.
Throws:
java.io.IOException

queryExpansionResult

public void queryExpansionResult(java.lang.String query,
                                 java.util.ArrayList tokenSet,
                                 java.util.ArrayList stemSet,
                                 float[] stemStemVector_AC,
                                 java.lang.String dbListFile)
                          throws java.io.IOException
This method gets the results of the query expansion, updates the database and the output in GUI as well.

Each stem has a value called "association correlation factor", whose calculation is based on the algorithm "Association Clustering", only for 20 stems of them with the top values, their "metric correlation factor", based on the algorithm "Metric Clustering", are to be calculated , these two factors are to be added, the 10 stems with the top summe value are the top expanded stems for the query-stem.

Parameters:
query - The query(word) to be expanded.
tokenSet - The set of tokens found in all the documents.
stemSet - The set of stems found in all the documents.
stemStemVector_AC - The vector for the query-stem from the "stem-stem-matrix", whose elements are called (normalized)"association correlation factor".
dbListFile - A file, which registers all the stems in the database.
Throws:
java.io.IOException
See Also:
MatrixVector.getStemStemVector_AC(Hashtable, String), MatrixVector.getTopStemStemVector_MC(String, ArrayList, ArrayList, int[]), updateDBListFile(String, String), updateStemInfoFile(String, int, Hashtable, String[])

printQeResult

private void printQeResult(java.lang.String query,
                           java.lang.String[] rankedStemList,
                           java.util.Hashtable stemTokenTable,
                           int numOfExpandedStems,
                           java.lang.String qeResultFile)
                    throws java.io.IOException
The method prints and saves a result of query expansion in a file, the result contains the 10 top expanded stems and their token-set.

Parameters:
query - A query, an English word, its stem is to be expanded.
rankedStemList - 10 top stems as expanded stem for the query-stem.
stemTokenTable - A hashtable, each element contains a stem as key, and a set of its tokens as value.
numOfExpandedStems - 10 top stems as expanded stems for the query-stem.
qeResultFile - The saved saving the results of query expanion for the query-stem.
Throws:
java.io.IOException

numOfNotZeroElements

private int numOfNotZeroElements(float[] stemStemVector)
Checks the total number of not-zero-elements in the "stem-stem-matrix"(association correlation factor) to quarantee that there are enough stems as candidates for the query expansion.

Parameters:
stemStemVector - The "stem-stem-matrix"(association correlation factor).
Returns:
The number of not-zero-elements in the vector.

updateDBListFile

private void updateDBListFile(java.lang.String query,
                              java.lang.String dbListFile)
                       throws java.io.IOException
Updates the file, which registers all the stems in the database.

Parameters:
query - The query, an Enlish word, its stem is to be expanded.
dbListFile - The file registering all the stems in the database.
Throws:
java.io.IOException
See Also:
deleteStemInDBList(String, String)

updateStemInfoFile

public void updateStemInfoFile(java.lang.String query,
                               int web,
                               java.util.Hashtable stemTokenTable,
                               java.lang.String[] rankedStemList)
                        throws java.io.IOException
Updates a file for a stem, this file saves the date of the download and analysis(query expansion), the number of the websites downloaded, its expanded stems and its token-set.

Parameters:
query - The query, an Enlish word, its stem is to be expanded.
web - The number of the websites downloaded for the analysis and query expansion.
stemTokenTable - A hashtable, each element contains a stem as key, and a set of its tokens as value.
rankedStemList - A list of 10 expanded stems for the query-stem.
Throws:
java.io.IOException
See Also:
deleteStemInfoFile(String)

inDBList

public static boolean inDBList(java.lang.String query,
                               java.lang.String dbListFile)
                        throws java.io.IOException
Checks, wheather a stem is registered already in the database.

Parameters:
query - The query, an Enlish word, its stem is to be expanded.
dbListFile - The file registering all the stems in the database.
Returns:
true or false.
Throws:
java.io.IOException
See Also:
updateDBListFile(String, String), deleteStemInDBList(String, String)

deleteStemInDBList

public static void deleteStemInDBList(java.lang.String stem,
                                      java.lang.String dbListFile)
                               throws java.io.IOException
Deletes a stem out of the file(out of the database), which registers all the stems in the database.

Parameters:
stem - The stem to be deleted.
dbListFile - The file registering all the stems in the database.
Throws:
java.io.IOException
See Also:
updateDBListFile(String, String)

deleteStemInfoFile

public static void deleteStemInfoFile(java.lang.String stem)
Deletes the file, which saves the date of the download and analysis(query expansion), the number of the websites downloaded, its expanded stems and its token-set.

Parameters:
stem - The stem to be deleted.
See Also:
updateStemInfoFile(String, int, Hashtable, String[])

getStemFile

public static java.lang.String getStemFile(java.lang.String stem,
                                           int top)
                                    throws java.io.IOException
This method prints the top expanded stems and its token-set in the output of the GUI.

Parameters:
stem - The stem in the database, which is already analyzed and expanded.
top - The top expanded stems to be printed, the number is delivered by the "Top"-box of the GUI.
Returns:
The top expanded stems and its token-set to be printed.
Throws:
java.io.IOException
See Also:
ShowStartDialog

getDataList

public static java.lang.String[] getDataList(java.lang.String dbListFile)
                                      throws java.io.IOException
Prints all the stems in the database registered in a file into the "DATABASE"-list of the GUI.

Parameters:
dbListFile - The file registering all the stems in the database.
Returns:
A list of all the stems in the database.
Throws:
java.io.IOException
See Also:
GuiGeneration.updateDataList(JList)

getTokensFromStemInfoFile

public static java.lang.String getTokensFromStemInfoFile(java.lang.String stem)
                                                  throws java.io.IOException
Gets the token-set for a stem from a file, this file saves the date of query expansion, the number of the downloaded websites, its top expanded stems as well, this token-set is to be showed in the "DATABASE"-list(tip-text) of the GUI.

Parameters:
stem - The stem in the database, which is already analyzed and expanded.
Returns:
The token-set for a stem.
Throws:
java.io.IOException

getDateWebFromStemInfoFile

public static java.lang.String[] getDateWebFromStemInfoFile(java.lang.String stem)
                                                     throws java.io.IOException
Gets the date of download and analysis(query expansion) and the number of websites downloaded for a stem.

Parameters:
stem - The stem in the database, which is already analyzed and expanded.
Returns:
The date and the number of downloaded websites.
Throws:
java.io.IOException
See Also:
ShowStartDialog