What are stemming and lemmatization in NLP?

Stemming and lemmatization, are two methods used in Natural Language Processing (NLP) to standardize text and get words or documents ready for some more machine learning processing. 

Though these are two of the most popular canonicalization techniques, they happen to have certain limitations, our task is to reduce the words to their root form for pre-processing and this technique is known as canonicalization. Canonicalization is a method for transforming data with many representations into a standard or normal form. In this article, we’ll try to understand these limitations and how phonetic hashing comes to the rescue.

What is the difference between stemming and lemmatization?

Stemming is a canonicalization technique that attempts to reduce a word down to its root form by dropping its affixes. Lemmatization is another canonicalization technique, that tries to map a word to its lemma – thus, for correct results using lemmatization, the words must be spelled correctly in the corpus.

Limitations of Stemming and Lemmatization

A significant issue occurs in both methods when dealing with words with multiple variants of spellings due to different pronunciations. For example, the words ‘Colour’ and ‘Color’ might be treated differently by a stemmer, although they both mean the same thing. Likewise, ‘traveling’ and ‘traveling’ would give rise to two stems/lemmas despite being the variations of the same word. 

Implement Phonetic Search in Python with Soundex Algorithm

In this article, we will cover word similarity matching using the Soundex algorithm in Python.

Similar Reads

What is Soundex Algorithm and how does it work?

Soundex is a phonetic algorithm that can locate phrases with similar sounds. A Soundex search method takes a word as input, such as a person’s name, and outputs a character string that identifies a group of words that are (roughly) phonetically similar or sound (approximately) the same....

What are stemming and lemmatization in NLP?

Stemming and lemmatization, are two methods used in Natural Language Processing (NLP) to standardize text and get words or documents ready for some more machine learning processing....

Phonetic Hashing

Phonetic hashing is a technique used to canonicalize words that have the same phonetic characteristics. As a result of phonetic hashing, each word is assigned a hash code based on its phoneme, which is the smallest unit of sound. Thus, the words that are phonetic variants of each other tend to have the same code. Phonetic hashing is achieved using Soundex algorithms which are available for different languages. In this section, we will explore the...

American Soundex Algorithm

American Soundex Algorithm, which performs Phonetic Hashing on English language word, i.e. it takes a word in English and generates its hash code....