Spacy Tokenizer Example in Python : Implement in 4 Steps Only

Spacy is the advanced python NLP packages. It is used for pre processing of the text. The best part of it is that it is free and open source. There are many things you can do using Spacy like lemmatization, tokenizing, POS tag e.t.c on document. In this entire tutorial you will know how to implement spacy tokenizer through various steps.

But before going to the step part make sure you have installed spacy on your system.

Steps to implement Spacy Tokenizer

In this section you will know al the steps for tokenizing a document using the spacy. Make sure you should follow all the steps in order for better understanding .

Step 1: Import required libraries

The first and the basic step is to import all the necessary libraries. In my example I am suing Spacy package only. So lets impor it using the import statement.

import spacy

Step 2: Load your language model

There are many languages where you can perform tokenizing. You can find them in spacy documentation. In my example, I am using the English language model so let’s load them using the spacy.load() method. But make sure you have downloaded the model in your system.

To download the model use the following command in your terminal.

python -m spacy download en_core_web_sm

Now you will use the nlp.load() method to load the english model. Add the below line of code.

nlp = spacy.load("en_core_web_sm")

Step 3: Create a NLP document

The third step is to create a document that will be use for implementing spacy tokenizer. To create a document you have to pass the string as an argument to nlp() constructor.

text ="Data Science Learner is the best site for data science I have ever seen."
doc = nlp(text)

Step 4: Implement spacy tokenizer on the document

Now lets implement spacy tokenizer on the created document. Here I have used an empty list that will contains all the tokens in a list. Execute the complete code and see the output.

import spacy
nlp = spacy.load("en_core_web_sm")
text ="Data Science Learner is the best site for data science I have ever seen."
doc = nlp(text)
empty_list = []
for token in doc:
    empty_list.append(token)
print(empty_list)

Output

You can see the text has tokenized and even the full stop has been considered.

Spacy also provides you to label the tokens. You have to use the doc.ents for defining the entity for the tokens. Suppose I have the Headline title of a news. I will be able to label the tokens using Spacy . Execute the below lines of code.

import spacy
nlp = spacy.load("en_core_web_sm")
text ="Remarks by President Biden on Rebuilding Our Manufacturing to Make More in America"
doc = nlp(text)
for ent in doc.ents:
    print(ent.text +' - ' + ent.label_ +' - ' + str(spacy.explain(ent.label_)) )

Output

You can see how the word Biden, America has been Labelled.

End Notes

Spacy is the best NLP package for pre processing of text. These are the steps for implementing spacy tokenizer in python. I hoe you have liked this tutorial. If you have any query then you can con for more help.