Archive for the ‘LSA’ Category

El algoritmo Snowball y el stemming.

mayo 29, 2006

Que es el Stemming?
Durante la fase de tratamiento del texo (Tratare sobre esto mas adelante) podemos identificar cuatro fases.

  1. Linealizacion del texto
  2. Tokenizacion del texto
  3. Filtrado de palabras
  4. Stemming

Es de esta ultima de la que voy hablar un poco en esta entrada. En uno de los parrafos de mi futuro diplomarbeiten menciono lo siguiente

One of the problems in vector space models is the heterogeneity of the words. Removing suffixes by automatic means is an operation which is especially useful in this field. In order to maintain all the information together it is necessary to group those words with same lexema and meanings

As we saw before in a typical VSM environment, one has a collection of documents, each described by the words in the document, we can say that a document is represented by a vector of words, or terms. Normally terms with a common stem will have similar meanings, we can see that in the next example.\item Paice/Husk Stemming Algorithm

Exsiten varios algoritmos para hacer stemming :

  1. Paice/Husk Stemming Algorithm \cite{stemming:paice}
  2. Porter Stemming Algorithm \cite{stemming:porter}
  3. Lovins Stemming Algorithm \cite{Lovins68}
  4. Dawson Stemming Algorithm \cite{stemming:dawson}
  5. Krovetz Stemming Algorithm \cite{stemming:krovetz}

Yo me he decantado por el algoritmo de Porter, ya que es el usado por defecto dentro del poryecto Lucene. El algoritmo Snowball es algo asi como la implementación de referencai del proceso de stemming de Porter.

Un sencillo ejemplo para ver que es el stemming (sacado del paper de Porter) :

Palabra –> Stemm –> sufijo
CONNECT –> CONNECT –> no tiene
CONNECTED –> CONNECT –>ED
CONNECTING –> CONNECT –> ING
CONNECTION –> CONNECT & ION
CONNECTIONS –> CONNECT –> IONS

Anuncios

Conceptos matemáticos bajo LSA

mayo 4, 2006

Esto es una simple enumeración de los conceptos matemáticos que existen tras Latent Semantic Analysis :

  • Vectores :
  • Operaciones conutativas :
  • Coordenada, bases y dimensiones :
  • Deficinicion formal del espacio de vectores :
  • Coordenadas cartesianas :
  • Funciones de similitud y distancia en vectores :
  • Espacio de vectores R^n :
  • Producto escalar de dos vectores :
  • Distancia euclidea entre dos vectores :
  • Norma y vectores unitarios :
  • Similaridad del coseno en R^n :
  • Diagonalización de matrices
  • Vectores ortonormales
  • Multiplicación de matrices
  • Matriz diagonal
  • Matriz unitaria
  • Matriz Transpuesta
  • Matriz Inversa

MAs adelante ire profundizando dentro de cada uno de estos campos.

Ref : Indexing by Latent Semantic Analysis

mayo 1, 2006

Este es trabajo pionero dentro del campo de LSA, En el se describe por primera vez enque consiste esta técnica :

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents ("semantic structure") in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. Initial tests find this completely automatic method for retrieval to be promising.

En este paper se pone un ejemplo que se convertira en clásico y que se discutira en diferentes papers con posterioridad, previsiblemente tambien en el mio.

A concrete example may make the procedure and its putative advantages clearer. Table 2 gives a sample dataset. In this case, the document set consisted of the titles of 9 Bellcore technical memoranda. Words occurring in more than one title were selected for indexing; they are italicized. Note that there are two classes of titles: five about human-computer interaction (labeled c1-c5) and four about graph theory (labeled m1-m4). The entries in the term by document matrix are simply Deerwester – 9 – the frequencies with which each term actually occurred in each document. Such a matrix could be used directly for keyword-based retrievals or, as here, for the initial input of the SVD analysis.

Paper : Indexing by Latent Semantic Analysis (PDF)
Autores : Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, Richard A. Harshman
Publicación : Journal of the American Society of Information Science 1990
bibtex entry :

@article{ deerwester90indexing,
author = "Scott C. Deerwester and Susan T. Dumais and Thomas K. Landauer and George W. Furnas and Richard A. Harshman",title = "Indexing by Latent Semantic Analysis",journal = "Journal of the American Society of Information Science", volume = "41", number = "6", pages = "391-407", year = "1990"}