Below is the Implementation
Python
# First we need to import spacy import spacy # Creating blank language object then # tokenizing words of the sentence nlp = spacy.blank( "en" ) doc = nlp("w3wiki is a one stop\ learning destination for geeks.") for token in doc: print (token) |
Output:
w3wiki is a one stop learning destination for geeks .
We can also add functionality in tokens by adding other modules in the pipeline using spacy.load().
Python3
nlp = spacy.load( "en_core_web_sm" ) nlp.pipe_names |
Output:
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
Here is an example to show what other functionalities can be enhanced by adding modules to the pipeline.
Python
import spacy # loading modules to the pipeline. nlp = spacy.load( "en_core_web_sm" ) # Initialising doc with a sentence. doc = nlp("If you want to be an excellent programmer \ , be consistent to practice daily on GFG.") # Using properties of token i.e. Part of Speech and Lemmatization for token in doc: print (token, " | " , spacy.explain(token.pos_), " | " , token.lemma_) |
Output:
If | subordinating conjunction | if you | pronoun | you want | verb | want to | particle | to be | auxiliary | be an | determiner | an excellent | adjective | excellent programmer | noun | programmer , | punctuation | , be | auxiliary | be consistent | adjective | consistent to | particle | to practice | verb | practice daily | adverb | daily on | adposition | on GFG | proper noun | GFG . | punctuation | .
In the above example, we have used part of speech (POS) and lemmatization using NLP modules, which resulted in POS for every word and lemmatization (a process to reduce every token to its base form). We were not able to access this functionality before, this functionality is only added after we loaded our NLP instance with (“en_core_web_sm”).
Tokenization Using Spacy library
Before moving to the explanation of tokenization, let’s first discuss what is Spacy. Spacy is a library that comes under NLP (Natural Language Processing). It is an object-oriented Library that is used to deal with pre-processing of text, and sentences, and to extract information from the text using modules and functions.
Tokenization is the process of splitting a text or a sentence into segments, which are called tokens. It is the first step of text preprocessing and is used as input for subsequent processes like text classification, lemmatization, etc.
Creating a blank language object gives a tokenizer and an empty pipeline to add modules in the pipeline along with a tokenizer we can use: