Stemming and Lemmatization

What is Stemming and Lemmatization?

Stemming and lemmatization are techniques used to reduce words to their base or root form, which helps simplify text analysis and reduce the dimensionality of the data. Both techniques are commonly used in NLP tasks, such as text classification, information retrieval, and sentiment analysis, to improve the efficiency and accuracy of algorithms.

Stemming

Stemming involves removing affixes (prefixes and suffixes) from words to obtain their stems. This process usually follows a set of predetermined rules or heuristics and may not always result in valid words in the language. For example, the Porter stemming algorithm is a popular rule-based stemmer for the English language.

Examples:

Original word: "running"
Stemmed word: "run"
Original word: "happier"
Stemmed word: "happi" (Note that this is not a valid word in English, but it is still useful for text analysis purposes)

Lemmatization

Lemmatization, on the other hand, goes one step further by considering the context and converting words to their base form according to the language's morphological rules. Lemmatization typically involves looking up words in a morphological dictionary or using a morphological analysis algorithm to obtain their lemma or canonical form.

Examples:

Original word: "running" (as a verb)
Lemma: "run"
Original word: "better" (as an adjective)
Lemma: "good"

In the case of lemmatization, it is essential to know the part-of-speech (POS) of the word in context, as the same word form might have different lemmas depending on its grammatical role. For example, the word "running" can be a verb (e.g., "She is running") or a noun (e.g., "Running is good for you"). In the first case, the lemma would be "run," while in the second case, it would be "running."

Tokenization POS Tagging