Introduction to Topic Models

LDA model implementation

Step 1

Deadline: May 17th, 2012 -- to be sent to me by email

Implement a module that reads a collection of texts into data structures that allow us to iterate through these texts word by word (possibly also sentence by sentence or paragraph by paragraph), and compute various statistics for each word.

For each position i, (in each sentence/paragraph s) in each document d, your system should keep track of:

what word is at that position,
what numerical value (integer) we have assigned to that position (because we will assign values in the next step).

Your module should also have a data structure that represents the vocabulary of the text collection. We will need this to iterate over and compute word occurrence statistics.

Step 2

Deadline: May 31st, 2012 -- to be sent to me by email

Add as parameter to your system K -- the number of topics in the data.

Implement a function/method/subroutine that assigns a random integer between 1 and K (or 0 and K-1) to each position (token) in a document in the collection (if a document has length 100, you will assign 100 random integers). These integers represent the topics in our document collection.

For each word w in the vocabulary of the document collection, compute the following (keep all these values in different data structures):

The number of times it was assigned each of the topics (1-K or 0-(K-1))
The number of times it appears in the collection

For each topic k (k between 1 and K (or 0 and K-1)), compute the following counts (keep them in different data structures):

The number of times it appears in a document (to how many words in the document it has been assigned
The number of times it appears in the entire collection

Step 3

Deadline: June 14th, 2012 -- to be sent to me by email

Separate your data into training (90%) and testing (10%)

Set the elements of the alpha vector to the same value 50/K and for beta to 0.01.

Until convergence, or for a number of iterations (N = 1000 or more) (one iteration covers one full pass over the training portion of the document collection!):

for each position in each document, recompute the probability of seeing each topic (theta), and the probability of the word at that position under each topic (phi), and then sample a new topic as discussed in class (pretending all other assignments are correct):
- to "pretend" that you don't know the assignments for the current position, you must decrease the counts computed in step 2 of the assignments based on the current position (if the current position has assigned topic 2, for example, then the count for the number of times the current word was assigned to topic 2 must be decreased by 1, and the count of the number of times topic 2 appears in the current document must be decreased by 1, as well as the count of the number of times topic 2 appears overall).
- after sampling the topic for the current position, update the counts! (if you assigned topic 3 to the current word, increase the counts for the current word under topic 3 by 1, increase the number of times topic 3 appears in the current document, increase the number of times topic 3 appears overall).

Step 4

Deadline: June 28th, 2012 -- to be sent to me by email

Train the model for different values of alpha and beta, and plot the perplexity of the test data based on the computed distributions.

Step 5 -- OPTIONAL!

Deadline: End of the course

Change the basic LDA according to your favourite paper on the topic.