Tokenization

Tokenization refers to breaking a sentence into smaller units. This is one of the imperative steps when it comes to text pre-processing. For this iNLTK offers a function called tokenize(text, language code) which takes input text and its language code as the arguments.

Example:

We tokenize the sentence ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।’ (which is Hindi translation for ‘w3wiki is a great technology learning platform.’)

Python3

from inltk.inltk import tokenize
 
text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग \
प्लेटफॉर्म है।'
tokenize(text ,'hi')

Output:

['▁गी',
'क्स',
'▁फॉर',
'▁गी',
'क्स',
'▁एक',
'▁बेहतरीन',
'▁टेक्नोलॉजी',
'▁ल',
'र्न',
'िंग',
'▁प्लेटफॉर्म',
'▁है',
'।']

Hence, we have tokenized a sentence using iNLTK.

iNLTK: Natural Language Toolkit for Indic Languages in Python

We all are aware of the popular NLP library NLTK (Natural Language Tool Kit), which is used to perform diverse NLP tasks and operations. NLTK, however, is limited to dealing with English Language only. In this article, we will explore and discuss iNLTK, which is Natural Language Tool Kit for Indic Languages. As the name suggests, iNLTK is a Python library that is used to perform NLP operations in Indian languages.

Tokenization

Example:

Python3

iNLTK: Natural Language Toolkit for Indic Languages in Python

Categories

Contact US

Tokenization

Example:

Python3

iNLTK: Natural Language Toolkit for Indic Languages in Python

Similar Reads

Categories

Contact US