Latent Semantic Analysis

It’s the technique used in Mac OS X’s Mail program to catch spam, it’s patented :-( and it’s described in an awesome paper An Introduction to Latent Semantic Analysis (PDF) by Landauer, Foltz and Laham. Wow, check this out:<blockquote>

To objectively measure how well, compared to people, LSA captures synonymy, LSA’s knowledge of synonyms was assessed with a standard vocabulary test. The 80 item test was taken from retired versions of the Educational Testing Service (ETS) Test of English as a Foreign Language (TOEFL: for which we are indebted to Larry Frase and ETS). To make these comparisons, LSA was trained by running the SVD analysis on a large corpus of representative English. In various studies, collections of newspaper text from the Associated Press news wire and Grolier’s Academic American Encyclopedia (a work intended for students), and a representative collection of children’s reading have been used. In one experiment, an SVD was performed on text segments consisting of 500 characters or less (on average 73 words, about a paragraph’s worth) taken from beginning portions of each of 30,473 articles in the encyclopedia, a total of 4.5 million words of text, roughly equivalent to what a child would have read by the end of eighth grade. This resulted in a vector for each of 60 thousand words.

The TOEFL vocabulary test consists of items in which the question part is usually a single word, and there are four alternative answers, usually single words, from which the test taker is supposed to choose the one most similar in meaning. To simulate human performance, the cosine between the question word and each alternative was calculated, and the LSA model chose the alternative closest to the stem. For six test items for which the model had never met either the stem word and/or the correct alternative, it guessed with probability .25. Scored this way, LSA got 65% correct, identical to the average score of a large sample of students applying for college entrance in the United States from non-English speaking countries.</blockquote>

and then they go on to say this:<blockquote>

When LSA chose wrongly and most students chose correctly, it sometimes appeared to be because LSA is more sensitive to contextual or paradigmatic associations and less to contrastive semantic or syntagmatic features. For example, LSA slightly preferred “nurse” (cos = .47) to “doctor” (cos = .41) as an associate for “physician.”</blockquote>

because, of course, where you find a physician, you often also find a nurse with her or him.

So is this the start of something significant? Here we have this fairly simple statistical method (alas, patented :-( ) that can do some pretty sophisticated “learning”.

What I’d love to see is for someone to appy this technique to non-human signals. Like, say, dolphin signals. Is it possible that an LSA could analyse a stream of symbols representing the signals generated by dolphins? (I call them signals rather than sounds because they far exceed the limits of human hearing). Then what? You would have an LSA that would in theory embody some knowledge of the semantic relationships in dolphin language. There would be some kind of test, at a minimum, to create a system that would respond to dolphin signals in real time with related signals, and then study the reactions.