Tokenization
Tokenization refers to breaking a sentence into smaller units. This is one of the imperative steps when it comes to text pre-processing. For this iNLTK offers a function called tokenize(text, language code) which takes input text and its language code as the arguments.
Example:
We tokenize the sentence ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।’ (which is Hindi translation for ‘w3wiki is a great technology learning platform.’)
Python3
from inltk.inltk import tokenize text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग \ प्लेटफॉर्म है।' tokenize(text , 'hi' ) |
Output:
['▁गी', 'क्स', '▁फॉर', '▁गी', 'क्स', '▁एक', '▁बेहतरीन', '▁टेक्नोलॉजी', '▁ल', 'र्न', 'िंग', '▁प्लेटफॉर्म', '▁है', '।']
Hence, we have tokenized a sentence using iNLTK.
iNLTK: Natural Language Toolkit for Indic Languages in Python
We all are aware of the popular NLP library NLTK (Natural Language Tool Kit), which is used to perform diverse NLP tasks and operations. NLTK, however, is limited to dealing with English Language only. In this article, we will explore and discuss iNLTK, which is Natural Language Tool Kit for Indic Languages. As the name suggests, iNLTK is a Python library that is used to perform NLP operations in Indian languages.