|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectws.qe.Tools
This class contains the methods for creating necessary files or data to
support the whole program, for query expansion, and for data exchange in GUI
as well.
The implementation of this class is support by external libraries Jakarta
Lucene and Snowball Stemmer for Lucene, for more information please see Jakarta
Lucene ( javadoc ) and Snowball Stemer for Lucene ( download ).
Nested Class Summary | |
private class |
Tools.getDocWeb
This private class contains the run method for downloading
a website. |
Field Summary | |
private static int |
CHECK
One website downloading and processing is to be checked 5 times(one check pro second) within 5 seconds, wheather it is ended successfully. |
static java.lang.String[] |
ENGLISH_STOPWORDS
A list of English stop words. |
private int |
numOfDocs
The total number of the successfully downloaded websites. |
private static int |
SLEEP
1000 miliiseconds(1 second) wait-time between two checks. |
private int |
status
The status(success, no-success because of out-run time-limit, or Google search fault) of a website downloading and processing. |
static int |
TIME_LIMIT
The time-limit(5 seconds) for downloading and processing one website. |
static java.lang.String[] |
WEB_STOPWORDS
A list of the so-called web-stopwords. |
Constructor Summary | |
Tools()
Initializes a Tools object. |
Method Summary | |
static boolean |
areLetters(java.lang.String str)
Checks, wheather a string is only composed of letters(without figures or special symbols). |
static void |
deleteStemInDBList(java.lang.String stem,
java.lang.String dbListFile)
Deletes a stem out of the file(out of the database), which registers all the stems in the database. |
static void |
deleteStemInfoFile(java.lang.String stem)
Deletes the file, which saves the date of the download and analysis(query expansion), the number of the websites downloaded, its expanded stems and its token-set. |
private void |
doIndexing(org.apache.lucene.index.IndexWriter writer,
java.io.File file)
This method dose the indexing recursively, all the files unter a directory and its sub-directories are to be indexed. |
static void |
emptyCreateDirectory(java.lang.String path)
Empties a directory, if it exists, or creates a new one. |
static java.lang.String[] |
getDataList(java.lang.String dbListFile)
Prints all the stems in the database registered in a file into the "DATABASE"-list of the GUI. |
static java.lang.String[] |
getDateWebFromStemInfoFile(java.lang.String stem)
Gets the date of download and analysis(query expansion) and the number of websites downloaded for a stem. |
protected void |
getDocStem(java.lang.String inFile,
java.lang.String outFile)
This method gets all the tokens(words) in a file stemmed, writes the stems in another file. |
protected void |
getDocText(java.lang.String inFile,
java.lang.String outFile)
This method extracts text form a given html-file, saves it in a txt-file. |
protected void |
getDocToken(java.lang.String inFile,
java.lang.String outFile)
This method tokenizes a text, and extracts tokens from it, save them in another file. |
private java.lang.String |
getNum(int num)
Converts a int number to a string, for instance "3" to "003", "12" to "012". |
protected java.util.ArrayList |
getSet(java.lang.String indexPath)
This method gets a set of tokens or stems, given a index-directory. |
static java.lang.String |
getStem(java.lang.String token)
This method gets a given token/word stemmed. |
static java.lang.String |
getStemFile(java.lang.String stem,
int top)
This method prints the top expanded stems and its token-set in the output of the GUI. |
static java.util.Hashtable |
getStemTokenTable(java.util.ArrayList tokenSet)
This method converts a given list(a set) of tokens to a hashtable, the keys represent a set of stems, so each key is a stem, its value is a set of tokens for this stem. |
static java.lang.String |
getTokensFromStemInfoFile(java.lang.String stem)
Gets the token-set for a stem from a file, this file saves the date of query expansion, the number of the downloaded websites, its top expanded stems as well, this token-set is to be showed in the "DATABASE"-list(tip-text) of the GUI. |
static boolean |
inDBList(java.lang.String query,
java.lang.String dbListFile)
Checks, wheather a stem is registered already in the database. |
void |
indexDocs(java.lang.String filePath,
java.lang.String indexPath)
This method creats and maintains an index for a directory, all the files under this directory are to be indexed. |
static boolean |
isStopWord(java.lang.String word)
Checks, wheather a token/word is a stopword. |
static boolean |
isWebStopword(java.lang.String word)
Checks, wheather a token/word is a web-stopword. |
private int |
numOfNotZeroElements(float[] stemStemVector)
Checks the total number of not-zero-elements in the "stem-stem-matrix"(association correlation factor) to quarantee that there are enough stems as candidates for the query expansion. |
private void |
printQeResult(java.lang.String query,
java.lang.String[] rankedStemList,
java.util.Hashtable stemTokenTable,
int numOfExpandedStems,
java.lang.String qeResultFile)
The method prints and saves a result of query expansion in a file, the result contains the 10 top expanded stems and their token-set. |
void |
queryExpansionResult(java.lang.String query,
java.util.ArrayList tokenSet,
java.util.ArrayList stemSet,
float[] stemStemVector_AC,
java.lang.String dbListFile)
This method gets the results of the query expansion, updates the database and the output in GUI as well. |
private void |
updateDBListFile(java.lang.String query,
java.lang.String dbListFile)
Updates the file, which registers all the stems in the database. |
void |
updateStemInfoFile(java.lang.String query,
int web,
java.util.Hashtable stemTokenTable,
java.lang.String[] rankedStemList)
Updates a file for a stem, this file saves the date of the download and analysis(query expansion), the number of the websites downloaded, its expanded stems and its token-set. |
protected void |
UrlWebTextTokenStem(java.lang.String query,
int numOfDocuments)
This method gets URLs as Google search results for a query, which is delivered by "Web"-inputfield, and downloads and processes the websites. |
protected void |
WebTextTokenStem(java.lang.String[] urlsList,
java.lang.String webPath,
java.lang.String textPath,
java.lang.String tokenPath,
java.lang.String stemPath)
This method, given a list of URLs, downloads and processes the websites. |
private boolean |
WebTextTokenStem(java.lang.String urlAddress,
int urlNr,
java.lang.String webPath,
java.lang.String textPath,
java.lang.String tokenPath,
java.lang.String stemPath)
This method downloades and processes a website. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final int TIME_LIMIT
private static final int CHECK
private static final int SLEEP
private int status
private int numOfDocs
public static final java.lang.String[] ENGLISH_STOPWORDS
public static final java.lang.String[] WEB_STOPWORDS
Constructor Detail |
public Tools()
Method Detail |
protected void UrlWebTextTokenStem(java.lang.String query, int numOfDocuments) throws java.io.IOException, com.google.soap.search.GoogleSearchFault
query
- The query, an English word, which is delivered by
"Word"-inputfiled.numOfDocuments
- The total number of documents finally downloaded for the
further analysis.
java.io.IOException
com.google.soap.search.GoogleSearchFault
WebTextTokenStem(String, int, String, String, String, String)
private boolean WebTextTokenStem(java.lang.String urlAddress, int urlNr, java.lang.String webPath, java.lang.String textPath, java.lang.String tokenPath, java.lang.String stemPath) throws java.io.IOException
getDocWeb
,getDocText()
,
getDocToken()
and getDocStem()
together.
urlAddress
- A URL as Google search result.urlNr
- The index number of a URL.webPath
- A directory saving the downloaded websites.textPath
- A directory saving the documents of texts.tokenPath
- A directory saving the documents of tokens.stemPath
- A directory saving the documents of stems.
true
, if a website is successful within the
time-limit downloaded and processed, otherwise false
.
java.io.IOException
Tools.getDocWeb
,
getDocText(String, String)
,
getDocToken(String, String)
,
getDocStem(String, String)
protected void WebTextTokenStem(java.lang.String[] urlsList, java.lang.String webPath, java.lang.String textPath, java.lang.String tokenPath, java.lang.String stemPath) throws java.io.IOException
getDocWeb
,getDocText()
,
getDocToken()
and getDocStem()
together.
urlsList
- A List of URLs as Google search results.webPath
- A directory saving the downloaded websites.textPath
- A directory saving the documents of texts.tokenPath
- A directory saving the documents of tokens.stemPath
- A directory saving the documents of stems.
java.io.IOException
Tools.getDocWeb
,
getDocText(String, String)
,
getDocToken(String, String)
,
getDocStem(String, String)
public static void emptyCreateDirectory(java.lang.String path)
path
- The path of this directory.public static boolean isStopWord(java.lang.String word)
word
- A token/word to be chekced.
true
or false
.ENGLISH_STOPWORDS
public static boolean isWebStopword(java.lang.String word)
word
- A token/word to be checked.
true
or false
.WEB_STOPWORDS
private java.lang.String getNum(int num)
num
- The number to be converted.
protected void getDocText(java.lang.String inFile, java.lang.String outFile) throws java.io.IOException
inFile
- A file with html-format, in which a downloaded website is
saved.outFile
- A file, in which the extracted texts are written.
java.io.IOException
WebTextTokenStem(String, int, String, String, String, String)
,
WebTextTokenStem(String[], String, String, String, String)
protected void getDocToken(java.lang.String inFile, java.lang.String outFile) throws java.io.IOException
inFile
- A file, which contains a text to be tokenized.outFile
- A file, in which the extracted tokens are to be written.
java.io.IOException
WebTextTokenStem(String, int, String, String, String, String)
,
WebTextTokenStem(String[], String, String, String, String)
protected void getDocStem(java.lang.String inFile, java.lang.String outFile) throws java.io.IOException
inFile
- A File, in which the tokens are to be stemmed.outFile
- A File, in which the stems are to be written.
java.io.IOException
WebTextTokenStem(String, int, String, String, String, String)
,
WebTextTokenStem(String[], String, String, String, String)
public static java.lang.String getStem(java.lang.String token) throws java.io.IOException
token
- A token/word, which is to be stemmed.
java.io.IOException
public static boolean areLetters(java.lang.String str)
str
- A string to be checked.
true
or false
.public void indexDocs(java.lang.String filePath, java.lang.String indexPath)
filePath
- The directory, under which all the files are to be indexed.indexPath
- The directory maintaining the created index.doIndexing(IndexWriter, File)
,
org.apache.lucene.index.IndexWriter
private void doIndexing(org.apache.lucene.index.IndexWriter writer, java.io.File file) throws java.lang.Exception
writer
- The indexWriter.file
- A directory, under which all the files are to be indexed, or a
file to be indexed.
java.lang.Exception
indexDocs(String, String)
,
org.apache.lucene.document.Document
protected java.util.ArrayList getSet(java.lang.String indexPath) throws java.io.IOException
indexPath
- The directory maining the index of all the documents of tokens
or stems.
java.io.IOException
indexDocs(String, String)
public static java.util.Hashtable getStemTokenTable(java.util.ArrayList tokenSet) throws java.io.IOException
tokenSet
- The set of tokens found in all the documents.
java.io.IOException
public void queryExpansionResult(java.lang.String query, java.util.ArrayList tokenSet, java.util.ArrayList stemSet, float[] stemStemVector_AC, java.lang.String dbListFile) throws java.io.IOException
query
- The query(word) to be expanded.tokenSet
- The set of tokens found in all the documents.stemSet
- The set of stems found in all the documents.stemStemVector_AC
- The vector for the query-stem from the "stem-stem-matrix",
whose elements are called (normalized)"association correlation
factor".dbListFile
- A file, which registers all the stems in the database.
java.io.IOException
MatrixVector.getStemStemVector_AC(Hashtable, String)
,
MatrixVector.getTopStemStemVector_MC(String, ArrayList, ArrayList,
int[])
,
updateDBListFile(String, String)
,
updateStemInfoFile(String, int, Hashtable, String[])
private void printQeResult(java.lang.String query, java.lang.String[] rankedStemList, java.util.Hashtable stemTokenTable, int numOfExpandedStems, java.lang.String qeResultFile) throws java.io.IOException
query
- A query, an English word, its stem is to be expanded.rankedStemList
- 10 top stems as expanded stem for the query-stem.stemTokenTable
- A hashtable, each element contains a stem as key, and a set of
its tokens as value.numOfExpandedStems
- 10 top stems as expanded stems for the query-stem.qeResultFile
- The saved saving the results of query expanion for the
query-stem.
java.io.IOException
private int numOfNotZeroElements(float[] stemStemVector)
stemStemVector
- The "stem-stem-matrix"(association correlation factor).
private void updateDBListFile(java.lang.String query, java.lang.String dbListFile) throws java.io.IOException
query
- The query, an Enlish word, its stem is to be expanded.dbListFile
- The file registering all the stems in the database.
java.io.IOException
deleteStemInDBList(String, String)
public void updateStemInfoFile(java.lang.String query, int web, java.util.Hashtable stemTokenTable, java.lang.String[] rankedStemList) throws java.io.IOException
query
- The query, an Enlish word, its stem is to be expanded.web
- The number of the websites downloaded for the analysis and
query expansion.stemTokenTable
- A hashtable, each element contains a stem as key, and a set of
its tokens as value.rankedStemList
- A list of 10 expanded stems for the query-stem.
java.io.IOException
deleteStemInfoFile(String)
public static boolean inDBList(java.lang.String query, java.lang.String dbListFile) throws java.io.IOException
query
- The query, an Enlish word, its stem is to be expanded.dbListFile
- The file registering all the stems in the database.
true
or false
.
java.io.IOException
updateDBListFile(String, String)
,
deleteStemInDBList(String, String)
public static void deleteStemInDBList(java.lang.String stem, java.lang.String dbListFile) throws java.io.IOException
stem
- The stem to be deleted.dbListFile
- The file registering all the stems in the database.
java.io.IOException
updateDBListFile(String, String)
public static void deleteStemInfoFile(java.lang.String stem)
stem
- The stem to be deleted.updateStemInfoFile(String, int, Hashtable, String[])
public static java.lang.String getStemFile(java.lang.String stem, int top) throws java.io.IOException
stem
- The stem in the database, which is already analyzed and
expanded.top
- The top expanded stems to be printed, the number is delivered
by the "Top"-box of the GUI.
java.io.IOException
ShowStartDialog
public static java.lang.String[] getDataList(java.lang.String dbListFile) throws java.io.IOException
dbListFile
- The file registering all the stems in the database.
java.io.IOException
GuiGeneration.updateDataList(JList)
public static java.lang.String getTokensFromStemInfoFile(java.lang.String stem) throws java.io.IOException
stem
- The stem in the database, which is already analyzed and
expanded.
java.io.IOException
public static java.lang.String[] getDateWebFromStemInfoFile(java.lang.String stem) throws java.io.IOException
stem
- The stem in the database, which is already analyzed and
expanded.
java.io.IOException
ShowStartDialog
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |