Stemming and Lemmatization
When working with Natural Language, we are not much interested in the form of words – rather, we are concerned with the meaning that the words intend to convey. Thus, we try to map every word of the language to its root/base form. This process is called canonicalization.
E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action – hence, we can map them all to their base form i.e. ‘play’.
Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.
Introduction to NLTK: Tokenization, Stemming, Lemmatization, POS Tagging
Natural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text – NLTK’s API has covered everything. In this article, we will accustom ourselves to the basics of NLTK and perform some crucial NLP tasks: Tokenization, Stemming, Lemmatization, and POS Tagging.
Table of Content
- What is the Natural Language Toolkit (NLTK)?
- Tokenization
- Stemming and Lemmatization
- Stemming
- Lemmatization
- Part of Speech Tagging