Regular expression allows you to find the pattern in the text. But the text you want to find is fixed. But if you use the rule base matching using Spacy then the text is matched using tokens, phrases, entities e.t.c which is a set of pre-defined patterns. To achieve it you have to use Spacy matcher.
There are three kinds of matching methods available as follows.
- Token Matcher
- Phrase Matcher
- Entity Ruler
In this entire tutorial, you will know how to perform rule base matching using Token Matcher.
What is Token Matcher
Spacy provides the rule-based matching engine that is Matcher. It operates on tokens extracted from text. The rule matcher also lets you pass in a custom callback to act on matches. All the matches are done using the patterns defined by the Matcher.
Steps to implement Token Matcher
In this entire section, you will know how to extract information by matching text as per defined patterns. But before going to the demonstration part make sure you have installed spacy in your system. Also, follow all the steps for better understanding.
Step 1: Import the required package
The first step is to import all the required packages for implementing the spacy matcher. Here I am using the spacy package only and also importing Matcher. Use the below line of code to import them.
import spacy
from spacy.matcher import Matcher
Step 2: Load the Language model
There are many languages to implement Spacy matcher. In my example, I am using the English language model so let’s load them using the spacy.load() method. But make sure you have downloaded the model in your system.
To download the model use the following command in your terminal.
nlp = spacy.load("en_core_web_sm")
Step 3: Call the Spacy Matcher
The third step is to call all the vocabulary of the NLP and pass it into the Matcher() constructor.
matcher = Matcher(nlp.vocab)
Step 4: Define the Pattern
Let’s create a pattern that will use to match the entire document and find the text according to that pattern. For example, I want to find an email address then I will define the pattern as below.
pattern = [{"LIKE_EMAIL":True}],
You can find more patterns on Spacy Documentation.
After that, you have to add the pattern to the Matcher that will be used for finding the text. Add the below line to add the pattern.
matcher.add("EMAIL",[pattern])
You can use any name for the pattern you want. In my case, I am defining the pattern name “EMAIL”.
Step 5: Apply the pattern
After defining the pattern now you have to apply this pattern to the document. For the sake of simplicity, I am creating a sample document. However, you can use your own document. Below is the document I have created.
text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)
After that pass the document as a parameter to the matcher.
matches = matcher(doc)
Step 6: Display the matched Text
The last step is to find the matched text from the document. In my case, it is an email address. There can be more than one match in the document. Therefore I have to run the loop over it. Add the following lines of code.
for match_id,start,end in matches:
print(doc[start:end])
Here is the complete code. When you will run the code you will get all the email addresses in the document.
Complete Code
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern =[{"LIKE_EMAIL":True}]
matcher.add("EMAIL",[pattern])
text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)
matches = matcher(doc)
for match_id,start,end in matches:
print(doc[start:end])
Output

You can also define more than one pattern and find the text in your document. For example, I also want to find all the names or nouns in the text then I will use the pattern [{“POS”: “PROPN”}].
Run the complete code given below. You will find all the names with the email address for the document.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [[{"LIKE_EMAIL": True}], [{"POS": "PROPN"}]]
matcher.add("My_Pattern",pattern)
text = "You can contact Data Science Learner through email address [email protected]"
doc = nlp(text)
matches = matcher(doc)
for match_id,start,end in matches:
print(doc[start:end])
Output

Conclusion-
Spacy matcher is very useful for finding any text in a document using rule-based matching. There are many applications of it. For example, you can use it to extract email addresses, names, addresses e.t.c from an invoice in pdf format. These are the steps for implementing a spacy matcher. I hope you have liked this tutorial. If you have any queries then you can contact us for more help.
Must-Read Articles :
To strengthen the base around this article, please read the below articles-
How to Install en_core_web_lg Spacy Language model
Spacy Tokenizer Example in Python : Implement in 4 Steps Only
spacy lemmatization Implementation in Python : 4 Steps only
How to Install Spacy in Juypter and Command Prompt?
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.