Similar topics models are developed at Berkeley and in Finland (Mark at OSTP) Our topics model (also called latent dirichlet allocation model) does not use EM. Instead, it uses Markov Chain Monte Carlo simulation with Gibbs Sampling (see our paper for what that means). This is a Gibbs sampler for the Latent Dirichlet Allocation generative model. It requires the Mersenne Twister random number generator code in cokus.c. To make the code on a Linux machine, put all files in a single directory and type "make bars". INPUT: Topics works over a term-document matrix that represents how often any of the terms in a set of documents appear in any of the documents in your data set. Most cell entries are zero and hence a sparse matrix summary of this matrix is used to represent this matrix. When you determine the list of unique terms in a document set, this term set is also called a lexicon), be sure to delete all function words ("the", "a", "is" etc). They will mess up the topics. OPERATION AND OUTPUT: The algorithm performs a "burn-in" of length set by the user,
intended to allow the Markov Chain to reach its stationary distribution.
Samples are then generated with a lag between them to reduce correlation.
Once each sample is generated, summary information is saved out to a (potentially
large) file sample{n}.dat, where {n} is the number of the sample, starting
from 0. The code is currently written to save out the matrix of counts
of the number of times a particular word occurs in a particular topic.
This eg. For the bars data, sample0.dat will contain 10 columns and 25 rows, with entries representing the number of time a particular word (row) was assigned to a particular topic (column). Plotting the results with the included Matlab code baranalysis.m will show the topics extracted by the model. Notes: This code has not been optimized and nobody has had time yet write the code in a fashion that scales naturally. This means that there is no dynamic memory allocation, or any other things that would make the code easy to apply to new data sets. In particular, the output files tend to be large because they contain a great many zeros. The output format used here is mainly for illustrative purposes, and any output generated in this fashion should be converted into sparse matrices using Matlab or some other program for storage. A more efficient way of producing the same information is to write out the files in the same [words documents counts] format that is used for the input data, which can also be converted into a sparse matrix by Matlab. The estimation procedure: Look over the code to see if anything looks mysterious. There are two free parameters in the model that determine smoothing (ALPHA and BETA). The current values should work on most datasets so there is not much to change in the code besides providing the right values for W, D, NMAX, and KMAX (the number of topics, the most important variable). The output: Remember that the ldacora.cpp program is a Gibbs Sampler. You do not get a single estimate for the parameters (as in maximum likelihood) but a stream of samples that start after BURNIN iterations with LAG iterations in between. For most of our purposes, each single estimate is already quite good and no integration is needed over samples. The program saves sparse matrices of how many word tokens were assigned to each topic (the wp matrix saved in the wp*.txt files). Another file is ll*.txt which saves the loglikelihood values at each iteration. Ignore the dp files. Note that the displayed output also shows the loglikelihood values. These values are based on Bayesian Model Selection and will automatically penalize solutions with a large number of topics (see our PNAS paper). Interpreting Topics: When the estimation procedure saved the first sample (after BURNIN iterations),
you can use the Matlab program to save a nicely formatted text file of
the topics (example provided in topics.txt). Note that this
The current topics code requires a input file that has a header ....
followed by three columns. The first column gives the word indices (ranging
from 1 to the total number of unique words), the second column the document
indices (ranging from 1 to total #documents), and the third column gives
the frequency Topics takes a file formatted as follows: Keywords Documents (describing what two entities are) Data input is a matrix of terms vs. documents (e.g., words vs. abstracts).
An input file consists of three lines: Note: This representation is the same as that obtained when applying the find command to a sparse matrix in Matlab, and facilitates converting data back and forth. Topics returns a hashmap which contains two matrix models: one containing counts of keywords by topic and the other containing documents by topic.
The original Topics code (SVDPACK) was provided by Tom Griffith and Mark Steyvers. Ben Ashpole translated the code into Java. Sriram Raghuraman integrated the code into the repository. |