Thursday, January 11, 2007

Lemmatisation

I learned a new term today: "Lemmatisation".

I was in a video conference talking about searching Video On Demand content from a set top box. The company providing the server side indexing system mentioned the concept of "Lemmatisation and Stemming" in passing without any explanation. So...I did a little research and found that this term is used in the linguistics area of artificial intelligence.

Lemmatisation takes a sentence (or any sequence of words) and parses it identifying the part of speech (noun, adverb, etc.) of each word and then reduces the word to its base (cannonical) meaning. For example go, goes, going, and went would all be replaced by "go". Because the analyzer requires a knowledge of the grammar being used different languages require different parsers.

Stemming just removes any common pre/post fixes to get to the root meaning of the word. This is a much simpler form of lemmatisation because it doesn't need to analyze any context nor does it care about language grammar.

In the Video On Demand example, the searchable text and the search string will be lemmatised to reduce the potential dictionary of search terms and make fuzzy connections between what is explicitly being requested and what is being returned.

Pretty powerful stuff.

1 comment:

Anonymous said...

Lemmatisation (and stemming) looks interesting as a concept of "searching" databases.

Did you come across any software that does basic lemmatisation? ie. input full text... output lemmed text?

If you are interested in a discussion on the topic: lemonadesoda at hot mail.com