Note e segnalazioni sparse su Latent Semantic Analysis / Indexing in modalità patchwork.
Origine
Using Latent Semantic Analysis To Improve Access To Textual Information (1988) Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Scott Deerwester, Richard Harshman
Proceedings of the Conference on Human Factors in Computing Systems CHI’88
Brevetto
Computer information retrieval using latent semantic structure
United States Patent: 4839853 – Deerwester, et al. – June 13, 1989
Filed: September 15, 1988 (USPTO)
Latent Semantic Indexing (LSI) is a novel, patented information retrieval method developed at Telcordia Technologies, Inc. (Telcordia)
Telcordia Technologies, formerly Bell Communications Research or Bellcore, is an American telecommunications R&D (Research & Development) company (Wikipedia)
Descrizione e principi fondamentali
Because of the tremendous diversity in the words people use to describe the same object, lexical matching methods are necessarily incomplete and imprecise.
The latent semantic indexing approach tries to overcome these problems by automatically organizing text objects into a semantic structure more appropriate for matching user requests. (Using Latent Semantic Analysis To Improve Access To Textual Information)
Roughly speaking, by analysis of a collection of texts, LSI will learn that “laptop” and “portable” occur in many of the same contexts, and that queries about one should probably retrieve documents about the other. Unlike hand-crafted knowledge bases or thesauri, LSI is completely automatic and widely applicable. (Telcordia)
We assume there is some underlying “latent” semantic structure in word usage data that is partially obscured by the variability of word choice. We use statistical techniques to estimate this latent structure and get rid of the obscuring “noise”. A description of terms, objects and user queries based on the underlying latent semantic structure (rather than surface level word choice) is used for representing and retrieving information. (Using Latent Semantic Analysis To Improve Access To Textual Information)
The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. (Indexing by Latent Semantic Analysis)
Nel LSI la ricerca avviene per concetti: ma un concetto non è l’astrazione-generalizazzione di un termine (es: golf -> vestiario) bensì un insieme di termini correlati (golf, maglia, vestito) detti co-occorrenze o dominio semantico. (LSI – UniRoma)
Partendo da una rappresentazione vettoriale dei documenti, in cui ogni coordinata corrisponde a un termine, LSI cerca di “proiettare” i vettori dei documenti in un “sotto-spazio semantico latente” a dimensionalità ridotta, in cui le coordinate sono i “concetti”.
Intuitivamente, un concetto può essere visto come un insieme di termini che occorrono (frequentemente) insieme negli stessi documenti. Di fatto, LSI opera un clustering dei termini (e dei documenti). (LSI – UniBologna)
The development of LSA was motivated by the inability of the vector-space model to handle synonymy – a query about “boats” will not retrieve documents about “ships” in the standard vector-space model. LSA solves this problem by reducing the original high-dimensional vector space into a much smaller space (but still relatively large; usually a few hundred dimensions), in which the original dimensions that represented words and documents have been condensed into a smaller set of “latent” dimensions that collapses words and documents with similar context vectors. This alleviates the problem with synonymy when retrieval is performed in the reduced space.
The dimensionality reduction is accomplished by using a statistical dimensionality reduction technique called Singular Value Decomposition (SVD). (The Word-Space Model)
Il significato di una parola può essere considerato come la media dei significati di tutti i brani in cui essa compare, e il significato di un brano come la media dei significati di tutte le parole contenute (Un modello statistico del linguaggio basato sull’LSA)
LSI sfrutta la co-occorenza dei termini nei documenti per individuare la struttura semantica che si cela (latente) nell’associazione tra termini e documenti. LSI non considera la semantica dei termini. LSI parte dalla scomposizione SVD (Singular Value Decomposition) della matrice delle occorrenze (matrice termini-documenti) e procede riducendone la dimensionalità, proiettando quindi documenti, termini e query in un sotto-spazio le cui coordinate sono dei concetti. In questo modo LSI risolvere il problema delle sinonimie (esempio: la perdita di documenti che contengono solo “PC” quando si cerca “computer”; tale perdita comporta la riduzione del rapporto tra numero di documenti rilevanti restituiti dalla query e numero totale dei documenti rilevanti: low recall); può invece avere difficoltà nella gestione del problema delle polisemie (esempio: vengono restituite pagine che contengono “internet” quando si cerca “surfing”; tale errore causa una diminuzione del numero di pagine pertinenti restituite, sul totale delle pagine restituite). (emmeesse)
Problemi
The problem is that the word-space methodology relies on statistical evidence to construct the word space – if there is not enough data, we will not have the required statistical foundation to build a model of word distributions.
At the same time, the co-occurrence matrix will become prohibitively large for any reasonably sized data, which seriously affects the scalability and effciency of the algorithm.
This presents us with the following delicate dilemma: on the one hand, we need as much data as we can get our hands on in order to build a truthful model of language use; on the other hand, we want to use as little data as possible because our algorithms will become computationally prohibitive otherwise. (The Word-Space Model)
The poor scalability of the singular value decomposition (SVD) algorithm remains an obstacle to indexing very large collections. While techniques have been developed for making incremental updates to a scaled collection, these changes typically cannot exceed a certain threshold without triggering a rebuild. These constraints make LSI ill suited to the kinds of large, rapidly changing document collections typically found on the Web. (Semantic Search of Unstructured Data using Contextual)
A further disadvantage to LSI is the difficulty in interpreting the underlying reduced term space. This makes it difficult to select an optimum number of singular values to retain in the SVD for a given collection, or allow domain expert adjustment of relevance values in the reduced space once the SVD has been calculated. (Semantic Search of Unstructured Data using Contextual)
In ogni caso l’LSA attualmente applicata presenta alcune limitazioni aggiuntive quali l’inutilizzo dell’ordine delle parole, e pertanto le relazioni sintattiche o logiche, o della morfologia. (Un modello statistico del linguaggio basato sull’LSA)
Matematicamente
Capitolo 4.1 da pag. 71 a pag. 82 (Un approccio innovativo al content management basato su LSI e reti di similarità)
Latent Semantic Indexing – UniBologna
LSI e SDD (Semi-Discrete Matrix Decomposition)
SDD è una decomposizione che può essere usata al posto di SVD:
Kolda and O’Leary (1998) found that for equal query times, the SDD produced precision rates similar to SVD with only one-tenth of the storage. However the decomposition requires more time to decompose the original matrix, and requires a higher dimension(k) than SVD.
The Semi-Discrete Matrix Decomposition is similar to the SVD, in that the original is decomposed into three matrices Ak = Xk Dk Yk* However, the matrices Xk and Yk* use entries from the set -1,0 and 1 (Kolda, 1997). (IR using LSI and a SDD)
Sulla differenza tra Latent Semantic Indexing e Latent Semantic Analysis
It can be said that LSI is the analysis of latent semantics in which a specific technique, SVD, is used. (Demystifyng LSA, LSI, SVD, PCA, and is acronyms)
The terms LSI and LSA have since come to be used more or less synonymously in the literature, but whereas “LSI” is used primarily in the context of information retrieval, “LSA” is used for the more general application of these ideas. (The Word-Space Model)
LSI e LSA vengono effettivamente usati come sinonimi; a me piace invece (in accordo con E. Garsia) considerare LSI come una particolare tecnica di LSA: quella che fa uso di SVD.
Sulla differenza tra Latent Semantic Indexing e Information Space
The main different between LSI and IS is that LSI utilizes a singular value decomposition (SVD) on the term by document matrix, while IS utilizes principal components analysis (PCA) on the term by term matrix. (Demystifyng LSA, LSI, SVD, PCA, and is acronyms)
PCA
L’analisi delle componenti principali (PCA) è una tecnica che a partire da un insieme di variabili originarie X1 … Xm produce un nuovo insieme di variabili X1* … Xp* (p<=m) dove ciascuna Xj* è combinazione lineare di X1 … Xm.
Il modello PCA garantisce che le nuove variabili X1* … Xp* siano tra loro tutte non correlate.
Da non confondersi con l’Analisi Fattoriale.
(Analisi dei dati con applicazioni informatiche)
Altre risorse

26/07/2007 alle 16:05 |
ti devo ringraziare…questo post è più che utile! è davvero scritto bene e utilerrimissimo!!
complimenti per il tuo blog ;)