Description | Pros
and Cons | Applications | Details
| Usage Hints | References
| Acknowledgments
Latent Semantic Analysis (LSA) can be applied to induce and represent aspects of the meaning of words (Berry et al., 1995; Deerwester et al., 1990; Landauer & Dumais, 1997; Landauer et al.,1998). LSA is a variant of the vector space model that converts a representative sample of documents to a term-by-document matrix in which each cell indicates the frequency with which each term (rows) occurs in each document (columns). Thus a document becomes a column vector and can be compared with a user's query represented as a vector of the same dimension. In the case of digital libraries, terms may stand for words in abstracts or titles, author names, or cited references. Only author names, titles, abstracts, and cited references occurring in more than one document are included in the analysis. LSA extends the vector space model by modeling term-document relationships using a reduced approximation for the column and row space computed by the singular value decomposition of the term by document matrix. This is explained in detail subsequently.
n dimensional abstract semantic space in which each original term and each original (and any new) document are presented as vectors.In SVD a rectangular term-by-document matrix X is decomposed into the product of three other matrices W, S, and P' with {X} = {W} {S} {P}' (see figure below). W is a orthonormal matrix and its rows correspond to the rows of X, but it has Following the decomposition by SVD, the The reduced dimensionality solution then generates a vector of n real values to represent each document. The reduced matrix ideally represents the important and reliable patterns underlying the data in X. It corresponds to a least-squares best approximation to the original matrix X.
- Matrix multiply W (its rows correspond to the terms (rows) of X) by the reduced matrix S.
- Normalize resulting vectors, i.e., divide each vector by its length. This way all vectors are mapped onto a hypersphere.
- Determine the document-document matrix of inner products that defines the similarities between all documents.
The dot product between two row vectors of the reduced matrix reflects the degree to which two documents have a similar pattern of terms. The matrix XX' is the square symmetric matrix containing all these document-document dot products. Since S is diagonal and W is orthonormal, the following holds: XX'=WS2W'. Thus the i, j cell of XX' can be obtained by taking the dot product between the i and j rows of the matrix WS. That is, if one considers the rows of WS as coordinates for documents, dot products between these points give the comparison between documents.
Merging of document clusters continues until a single, all-inclusive cluster remains. At termination, a uniform, binary hierarchy of document clusters is produced. The figure below shows an example of how five documents may be clustered hierarchically. The cluster hierarchy is shown on the left hand side. Corresponding partitions are given on the right hand side. Note that each documents to itself has a similarity of one. Hierarchical Clustering
utility measure that contrasts the sum of within-cluster similarities wSim by the sum of between-cluster similarities bSim: utility = wSim / (wSim + bSim). The partition that shows the highest utility is selected for data visualization.
The advantage of SVD lies in the fact that documents are not represented by individual terms but by a smaller number of independent "artificial values" that can be specified by any one of several terms or combinations thereof. In this way, relevant documents that do not contain the terms of the query are retrieved via other terms in the query that can be properly identified. The resulting retrieval scheme allows one to order documents continuously by similarity to a query. A similarity threshold or a number of solutions can be specified depending on the user and task. In such a way, SVD overcomes two fundamental problems faced by traditional lexical matching indexing schemes:
- LVis Digital Library Visualizer (Börner et al., 2000)
The SVDPACK by M. Berry (1993) The package allows one to decompose matrices on the order of 50,000 by 50,000 (e.g. 50,000 words and 50,000 contexts) into representations in 300 dimensions with about two to four hours of computation on a high-end Unix work-station with around 100 Megabytes of RAM. The computational complexity is O(3Dz), where z is the number of non-zero elements in the word (W) X context (C) matrix and D is the number of dimensions returned. The maximum matrix size one can compute is usually limited by the memory (RAM) requirement, which for the fastest of the methods in the Berry package is (10+D+q)N + (4+q)q, where N=W+C and q=min (N, 600), plus space for the W X C matrix.
From the SVDPACK, the sparse
SVD via single-vector Lanczos algorithm will be used because of its
comparatively low memory and processing requirements and low to moderate
accuracies. The document by term input matrix has to be represented
in a compressed The package and all java files can be found on ella in '/home/www/ella/htdocs/classes/L697/code/lsa'. There are four separate steps involved in determining and visualizing the semantic relationships between documents:
- Use 'DocumentParser.java', 'DocumentNW.java', 'Document, java' and 'MatrixSort.java' by
*Katy Börner*to read in representative samples of documents like slis-courses.dat, to determine its document-term frequency matrix (cell entries are term frequencies in the documents), and to write the result out in compressed*Harwell-Boeing*storage format, like 'slis-courses.hbf'. - Just copy the four files and input *.dat files, compile with 'javac DocumentParser.java' and run with 'java DocumentParser'. Change input/output file names in 'DocumentParser.java' if needed.
2. Latent Semantic Analysis:- Copy all files in ../lsa/svdpack into your svdpack directory.
Code Input Output ------ ------------ --------- las2.c lap2, matrix lao2,lav2,lav2-data - Compiling with 'gcc las2.c timersun.o -lm' creates a.out.
- The program expects a file named 'matrix' in
*Harwell-Boeing*storage format. Just rename 'slis-courses.hbf' into 'matrix'. - Run with 'a.out'. The result (right-singular vectors, left-singular vectors, and singular values) is stored in 'lav2-data'. (If the current path is not set in your shell script properly, start c-programs with './a.out'.)
- You can influence how many singular values are generated by manipulating the second number in the parameter file lap2. The file contains a single line.
<name> lanmax maxprs endl endr vectors kappa
where <name> is the name of the
data set; lanmax
is an integer specifying maximum number of lanczos
step allowed; maxprs
indicates maximum number of singular triplets of A
(eigenpairs of the equivalent matrix B) desired; endl,endr are two end-points
of an interval within which all
unwanted eigenvalues of the particular matrix B lie; vectors
contains the string TRUE or FALSE to indicate when
singular triplets are needed (TRUE) and when only
singular values are needed (FALSE); kappa
contains the relative accuracy of ritz values
acceptable as eigenvalues of the matrix B. - Copy the right-singular vectors and singular values into a '*.lsa' file that has the following format:
RSV 2 7 SV 2 - Typically, you will have many more dimensions but never more than in your input matrix :)
3. Use LSA output to determine document-by-document similarity matrix:- Use 'SimParser.java' and 'SimParserMain.java' by Katy Börner to read in the Right Singular Vectors as well as the Singular Values (slis-courses.lsa) as computed by LSA, to matrix multiply W by the reduced matrix S, to normalize resulting vectors & to determine the document-document matrix of inner products that defines the similarities between all documents. Just copy the two files and sample input *.dat file, compile with 'javac SimParserMain.java' and run with 'java SimParserMain > out.txt'. Change input file name in 'SimParserMain.java' if needed. The resulting inner product matrix corresponds to the matrix in 'out.txt' after the line 'Inner Product:' For the slis-courses it looks like slis-courses.sim. Note that all diagonal values are 1.
- Note that 'SimParser.java' contains a hierarchical clustering method based on the Inner Product matrix (complete Linkage method) that can be used to quickly compute a cluster hierarchy.
- Berry, M. et al. SVDPACKC (Version 1.0) User's Guide, University of Tennessee Tech. Report CS-93-194, 1993 (Revised October 1996). See also http://www.netlib.org/svdpack/index.html.
- Berry, M., Dumais, S.T., & O'Brien, G.W. (1995). Using linear algebra for intelligent information retrieval. SIAM: Review, 37(4), 573-595. See also http://www.cs.utk.edu/~berry/
- Börner, K., Dillon, A. & Dolinsky, M. (2000) LVis - Digital Library Visualizer. Information Visualisation 2000, Symposium on Digital Libraries, London, England, 19 -21 July, pp. 77-81. (http://ella.slis.indiana.edu/~katy/IV-LVis/)
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing By Latent Semantic Analysis. Journal of the American Society For Information Science, 41, 391-407. See also http://lsa.colorado.edu/
- Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240. See also http://lsa.colorado.edu/
- Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284. See also http://lsa.colorado.edu/
- Landauer, T. K., Laham, D. & Foltz, P. W. (1998). Learning human-like knowledge by Singular Value Decomposition: A progress report. In M. I. Jordan, M. J. Kearns & S. A. Solla (Eds.), Advances in Neural Information Processing Systems 10, pp. 45-51. Cambridge: MIT Press. See also http://lsa.colorado.edu/
- Rehder, B., Schreiner, M. E., Wolfe, M. B., Laham, D., Landauer, T. K., & Kintsch, W. (1998). Using Latent Semantic Analysis to assess knowledge: Some technical considerations. Discourse Processes, 25, 337-354. See also http://lsa.colorado.edu/
- Kintsch, W. (in press). Predication. Cognitive Science. See also http://lsa.colorado.edu/
- Kintsch, W. (in press). Metaphor comprehension: A computational theory. Psyhonomic Bulletin and Review. See also http://lsa.colorado.edu/
The original LSA code (SVDPACK) was provided by Michael Berry berry@cs.utk.edu. Katy Börner implemented the parsers as well as code to compute the document by document or term-by-term matrices. Sriram Raghuraman integrated the code into the XML toolkit. |