Below is the Implementation

Python

# First we need to import spacy 
import spacy 
  
# Creating blank language object then 
# tokenizing words of the sentence 
nlp = spacy.blank("en") 
  
doc = nlp("w3wiki is a one stop\ 
learning destination for geeks.") 
  
for token in doc: 
    print(token) 

Output:

w3wiki
is
a
one
stop
learning
destination
for
geeks
.

We can also add functionality in tokens by adding other modules in the pipeline using spacy.load().

Python3

nlp = spacy.load("en_core_web_sm") 
  
nlp.pipe_names

Output:

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

Here is an example to show what other functionalities can be enhanced by adding modules to the pipeline.

Python

import spacy 
  
# loading modules to the pipeline. 
nlp = spacy.load("en_core_web_sm") 
  
# Initialising doc with a sentence. 
doc = nlp("If you want to be an excellent programmer \ 
, be consistent to practice daily on GFG.") 
  
# Using properties of token i.e. Part of Speech and Lemmatization 
for token in doc: 
    print(token, " | ", 
          spacy.explain(token.pos_), 
          " | ", token.lemma_) 

Output:

If  |  subordinating conjunction  |  if
you  |  pronoun  |  you
want  |  verb  |  want
to  |  particle  |  to
be  |  auxiliary  |  be
an  |  determiner  |  an
excellent  |  adjective  |  excellent
programmer  |  noun  |  programmer
,  |  punctuation  |  ,
be  |  auxiliary  |  be
consistent  |  adjective  |  consistent
to  |  particle  |  to
practice  |  verb  |  practice
daily  |  adverb  |  daily
on  |  adposition  |  on
GFG  |  proper noun  |  GFG
.  |  punctuation  |  .

In the above example, we have used part of speech (POS) and lemmatization using NLP modules, which resulted in POS for every word and lemmatization (a process to reduce every token to its base form). We were not able to access this functionality before, this functionality is only added after we loaded our NLP instance with (“en_core_web_sm”).

Tokenization Using Spacy library

Before moving to the explanation of tokenization, let’s first discuss what is Spacy. Spacy is a library that comes under NLP (Natural Language Processing). It is an object-oriented Library that is used to deal with pre-processing of text, and sentences, and to extract information from the text using modules and functions.

Tokenization is the process of splitting a text or a sentence into segments, which are called tokens. It is the first step of text preprocessing and is used as input for subsequent processes like text classification, lemmatization, etc.

Process followed to convert text into tokens

Creating a blank language object gives a tokenizer and an empty pipeline to add modules in the pipeline along with a tokenizer we can use:

Intermediate steps for tokenization

Below is the Implementation

Python

Python3

Python

Tokenization Using Spacy library

Categories

Contact US

Below is the Implementation

Python

Python3

Python

Tokenization Using Spacy library

Similar Reads

Categories

Contact US