Latent semantic indexing
Also See Theme Zoom
Most retrieval systems match words in users' queries with words in the text of documents in the database. While such systems are popular, they are far from perfect as anyone who has used an online library catalog or Web search engine can attest to. One aspect of the problem is lack of precision --- on average 50% of the information retrieved will be irrelevant, and this is quite evident to users. Another problem is recall failure --- you often retrieve as little as 20% of the available relevant information. This problem is much harder to grasp, since you don't know what you are missing! Yet, it is very important for searchers and information providers alike.
The main reasons for missing relevant information is that there are surprisingly many different ways to describe the same idea or concept. If a document author uses one word and a searcher another, relevant materials will be missed. A query about "laptop" computers, for example, will fail to find articles about "portable" or "lightweight" or "notebook" or "palmtop" or "ThinkPad" computers. Searchers and authors alike find it very difficult to anticipate the many ways in which the same idea might be described. By automatically constructing a semantic or concept space, LSI enables users to find relevant information even when it shares no words with their queries. It requires no additional work by either the searcher to painstakingly describe their needs or by the content provider to carefully handcraft a thesaurus or knowledge base.
LSI uses a powerful and fully automatic statistical method (singular value decomposition) to uncover the associations among terms in a large collection of texts, to create a semantic or concept space, and to exploit this to improve retrieval. As noted above, LSI is 30% more effective than popular word-matching methods in helping users find relevant information (e.g., Deerwester et al., 1990; Dumais, 1995). Roughly speaking, by analysis of a collection of texts, LSI will learn that "laptop" and "portable" occur in many of the same contexts, and that queries about one should probably retrieve documents about the other. Unlike hand-crafted knowledge bases or thesauri, LSI is completely automatic and widely applicable. It can handle multimedia descriptions, marketing brochures, trouble reports, email messages, or World Wide Web URLs with equal ease. In addition to it overall retrieval benefits, LSI is uniquely applicable to improving information access when:
1. high recall is necessary (e.g., matching new problems against a database of existing trouble reports and solutions, data mining efforts, law, medicine, research);
2. text descriptions are short (e.g., figure captions, multimedia information, ads);
3. user input or texts are noisy (e.g., pen or OCR input); and
4. there is a need to retrieve information in multiple languages without requiring translation of queries or documents.
LSI can be used in all these applications with no modifications to the existing algorithms. It can be used both to answer specific information requests and to monitor new information for more stable user interests. Because LSI can retrieve relevant information that does not contain query words, it finds more relevant information than other methods. Similarly, because it does not rely on literal matching, it can be used when the available textual information or user queries are short or noisy. And, it is the only known method for cross-language retrieval that does not require translation of user queries or information --- using LSI, queries in one language can effectively retrieve information in the same or different languages.
Deerwester S., Dumais, S. T Landauer T K., Furnas, G. W and Harshman, R. A. Indexing by latent semantic analysis. Journal of the Society for Information Science, 1990, 41(6), 391-407.
Dumais, S.T., Landauer, T. K. and Littman, M. L. (1996). Automatic cross-linguistic information retrieval using latent semantic indexing. In Proceedings of the ACM SIGIR '96 Workshop on Cross-Linguistic Information Retrieval, August 1996.