Lemmatization

Like stemming, lemmatization also converts a word to its root form. The only difference is that lemmatization ensures that the root word belongs to the language. We will get valid words if we use lemmatization. In NLTK, we use the WordNetLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter. 
 

Python
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()

def lemma_words(text):
    word_tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word) for word in word_tokens]
    return lemmas

input_str = "data science uses scientific methods algorithms and many types of processes"
lemma_words(input_str)

Example: 

Input: ‘data science uses scientific methods algorithms and many types of processes’ 
Output: [‘data’, ‘science’, ‘use’, ‘scientific’, ‘methods’, ‘algorithms’, ‘and’, ‘many’, ‘type’, ‘of’, ‘process’] 

Text Preprocessing in Python

Text Processing pertains to the analysis of text data using a programming language such as Python. Text Processing is an essential task in NLP as it helps to clean and transform raw data into a suitable format used for analysis or modeling.

In this article, we will learn by using various Python Libraries and Techniques that are involved in Text Processing.

Prerequisites: Introduction to NLP

Whenever we have textual data, we need to apply several processing and pre-processing steps to the data to transform words into numerical features that work with machine learning algorithms. The pre-processing steps for a problem depend mainly on the domain and the problem itself, hence, we don’t need to apply all steps to every problem. 
In this article, we are going to see text preprocessing in Python. We will be using the NLTK (Natural Language Toolkit) library here.  

Python
# import the necessary libraries
import nltk
import string
import re

Similar Reads

Text Lowercase

We lowercase the text to reduce the size of the vocabulary of our text data....

Remove numbers

We can either remove numbers or convert the numbers into their textual representations. We can use regular expressions to remove the numbers....

Remove punctuation

We remove punctuations so that we don’t have different forms of the same word. If we don’t remove the punctuation, then been. been, been! will be treated separately....

Remove default stopwords

Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens....

Stemming

Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, -ize, -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words....

Lemmatization

Like stemming, lemmatization also converts a word to its root form. The only difference is that lemmatization ensures that the root word belongs to the language. We will get valid words if we use lemmatization. In NLTK, we use the WordNetLemmatizer to get the lemmas of words. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter....

FAQs on Text-Processing in Python

Q1. What are the testing frameworks are commonly used for text processing in python?...