Implementation for Tokenization

Sentence Tokenization using sent_tokenize

The code snippet uses sent_tokenize function from NLTK library. The sent_tokenize function is used to segment a given text into a list of sentences.

Python3

from nltk.tokenize import sent_tokenize
 
text = "Hello everyone. Welcome to w3wiki. You are studying NLP article."
sent_tokenize(text)

Output:

['Hello everyone.',
 'Welcome to w3wiki.',
 'You are studying NLP article']

How sent_tokenize works ?

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation.

Sentence Tokenization using PunktSentenceTokenizer

When we have huge chunks of data then it is efficient to use ‘PunktSentenceTokenizer' from the NLTK library. The Punkt tokenizer is a data-driven sentence tokenizer that comes with NLTK. It is trained on large corpus of text to identify sentence boundaries.

Python3

import nltk.data
 
# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
tokenizer.tokenize(text)

Output:

['Hello everyone.',
 'Welcome to w3wiki.',
 'You are studying NLP article']

Tokenize sentence of different language

One can also tokenize sentence from different languages using different pickle file other than English. In the following code snippet, we have used NLTK library to tokenize a Spanish text into sentences using pre-trained Punkt tokenizer for Spanish. The Punkt tokenizer is a data-driven tokenizer that uses machine learning techniques to identify sentence boundaries.

Python3

import nltk.data
 
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
 
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

Output:

['Hola amigo.', 
 'Estoy bien.']

Word Tokenization using work_tokenize

The code snipped uses the word_tokenize function from NLTK library to tokenize a given text into individual words. The word_tokenize function is helpful for breaking down a sentence or text into its constituent words, facilitating further analysis or processing at the word level in natural language processing tasks.

Python3

from nltk.tokenize import word_tokenize
 
text = "Hello everyone. Welcome to w3wiki."
word_tokenize(text)

Output:

['Hello', 'everyone', '.', 'Welcome', 'to', 'w3wiki', '.']

How word_tokenize works?

word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

Word Tokenization Using TreebankWordTokenizer

The code snippet uses the TreebankWordTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text into individual words.

Python3

from nltk.tokenize import TreebankWordTokenizer
 
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

Output:

['Hello', 'everyone.', 'Welcome', 'to', 'w3wiki', '.']

These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn’t discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.

Word Tokenization using WordPunctTokenizer

The WordPunctTokenizer is one of the NLTK tokenizers that splits words based on punctuation boundaries. Each punctuation mark is treated as a separate token.

Python3

from nltk.tokenize import WordPunctTokenizer
 
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")

Output:

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

Word Tokenization using Regular Expression

The code snippet uses the RegexpTokenizer from the Natural Language Toolkit (NLTK) to tokenize a given text based on a regular expression pattern.

Python3

from nltk.tokenize import RegexpTokenizer
 
tokenizer = RegexpTokenizer(r'\w+')
text = "Let's see how it's working."
tokenizer.tokenize(text)

Output:

['Let', 's', 'see', 'how', 'it', 's', 'working']

Using regular expressions allows for more fine-grained control over tokenization, and you can customize the pattern based on your specific requirements.

More Techniques for Tokenization

We have discussed the ways to implement how can we perform tokenization using NLTK library. We can also implement tokenization using following methods and libraries:

Spacy: Spacy is NLP library that provide robust tokenization capabilities.
BERT tokenizer: BERT uses WordPiece tokenizer is a type of subword tokenizer for tokenizing input text. Using regular expressions allows for more fine-grained control over tokenization, and you can customize the pattern based on your specific requirements.
Byte-Pair Encoding: Byte Pair Encoding (BPE) is a data compression algorithm that has also found applications in the field of natural language processing, specifically for tokenization. It is a subword tokenization technique that works by iteratively merging the most frequent pairs of consecutive bytes (or characters) in a given corpus.
Sentence Piece: SentencePiece is another subword tokenization algorithm commonly used for natural language processing tasks. It is designed to be language-agnostic and works by iteratively merging frequent sequences of characters or subwords in a given corpus.

NLP | How tokenizing text, sentence, words works

Tokenization in natural language processing (NLP) is a technique that involves dividing a sentence or phrase into smaller units known as tokens. These tokens can encompass words, dates, punctuation marks, or even fragments of words. The article aims to cover the fundamentals of tokenization, it’s types and use case.

Implementation for Tokenization

Sentence Tokenization using sent_tokenize

Python3

How sent_tokenize works ?

Sentence Tokenization using PunktSentenceTokenizer

Python3

Tokenize sentence of different language

Python3

Word Tokenization using work_tokenize

Python3

How word_tokenize works?

Word Tokenization Using TreebankWordTokenizer

Python3

Word Tokenization using WordPunctTokenizer

Python3

Word Tokenization using Regular Expression

Python3

More Techniques for Tokenization

NLP | How tokenizing text, sentence, words works

Categories

Contact US

Implementation for Tokenization

Sentence Tokenization using sent_tokenize

Python3

How sent_tokenize works ?

Sentence Tokenization using PunktSentenceTokenizer

Python3

Tokenize sentence of different language

Python3

Word Tokenization using work_tokenize

Python3

How word_tokenize works?

Word Tokenization Using TreebankWordTokenizer

Python3

Word Tokenization using WordPunctTokenizer

Python3

Word Tokenization using Regular Expression

Python3

More Techniques for Tokenization

NLP | How tokenizing text, sentence, words works

Similar Reads

Categories

Contact US