Advanced Text Processing using NLTK : The Complete Guide

There are various processes for the text preprocessing like removing punctuations, stopwords, tokenization e.t.c that are able to create meaningful text inside the corpus. But these are basics text preprocessing. There are also other advanced text processing that helps you to create a meaningful feature for your NLP project. In this intuition, you will know about all of these processes. You will learn the following things

The building of N-Grams

Parts of Speech Tagging (POS)

TF-IDF (Term Frequency-Inverse Document Frequency) Text Mining

The building of N-Grams

You can say N-Grams as a sequence of items in a given sample of the text. N- Grams depend upon the value of N. It is bigram if N is 2 , trigram if N is 3 , four gram if N is 4 and so on. For example consider the text “ You are a good person“. Then the following is the N- Grams for it.

Bi-gram
(You, are) , (are,a),(a,good) ,(good person)

Tri-gram
(You, are, a ),(are, a ,good),(a ,good ,person)

I will continue the same code that was done in this post. However, the full code for the previous tutorial is

import os
import nltk

#read the file
file = open(os.getcwd()+ "/sample.txt","rt")
raw_text = file.read()
file.close()

#tokenization
token_list = nltk.word_tokenize(raw_text)

#Remove Punctuation
from nltk.tokenize import punkt
token_list2 = list(filter(lambda token : punkt.PunktToken(token).is_non_punct,token_list))


#upper to lower case
token_list3 = [word.lower() for word in token_list2]


#remove stopwords
from nltk.corpus import stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words("english"),token_list3))

#lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
token_list5 = [lemmatizer.lemmatize(word) for word in token_list4]
print(token_list5[0:20],"\n")
print("Total tokens : ", len(token_list5))

For n-gram you have to import the ngrams module from the nltk . Here I will print the bigrams and trigrams in the given sample text.

Bigrams

from nltk.util import ngrams
#find the bigrams
bigrams = ngrams(token_list5,2)
print(list(bigrams)

Trigrams

# Trigrams 
trigrams = ngrams(token_list5,3)
print(list(trigrams))

Parts of Speech Tagging (POS)

It generally used to identify the parts of speech for each word in a corpus. It means it identifies whether the word is a verb, noun, object e.t.c. NLTK package classifies the POS tag with a simple abbreviation like NN (Noun), JJ (Adjective) , VBP (Verb Singular Present). There are various popular use cases or POS tagging. It uses in entity recognization, filtering, and the sentiment analysis. Advanced use cases of it are building of a chatbot. To use the NLTK for pos tagging you have to first download the averaged perceptron tagger using nltk.download(“averaged_perceptron_tagger”). Then you will apply the nltk.pos_tag() method on all the tokens generated like in this example token_list5 variable.

nltk.download("averaged_perceptron_tagger")
# POS Tagging the first 10 words
nltk.pos_tag(token_list5)[:10]

TF-IDF (Term Frequency-Inverse Document Frequency) Text Mining

In machine learning machine inputs numerics only. But as the text has words, alphabets and other symbols. To apply machine learning on the text you will use the method TF-IDF to convert the text as the numeric table representation. TF-IDF table consists of rows for each document in a corpus. And the columns represent the words. Each value in a cell has the count/value that determines the strength of the words in that particular document.

Higher the strength of a word higher the correlation with the word and document.

How TF-IDF works? Steps by Steps

Step1: Read the corpus

First of all, read the corpus.

Step 2: Clean the corpus

After reading the corpus your next step is to clean the corpus like removing punctuation, stopwords e.t.c

Step 3: Create a Count Table

In this, you will create a table where the rows represent the documents and the columns represent words. Inside the cell, the value will be the count of the words in each of the document. That is how many times a word appeared in a document.

Step 4: Create a Text Frequency Table

After creating the count table the next step is to find the text frequency table. To find it you will divide each cell value of a document with the total number of words in the document. For example, if you have three words in a document with each cell value 1. Then divide it with 3 and you will get 0.33.

Step 5: Find the Inverse Document Frequency

There are formulae for finding the IDF that is Log (total docs/doc with words ). The main goal of the IDF is to find the unique word and words that give relevant meaning to the document. Fewer the documents with a word the higher are the IDF.

Step 5: Multiply TF*IDF

At last, step multiplies the IDF value of each word with TF value of that word in each cell.

Although the above steps are for learning the purpose. But we will use the sklearn TfidfVectorizer to fully automate these steps.

Lets we define a corpus with the following text.

corpus = [
    " Machine learning is the future",
    " Future will be full of automation",
    " Automation will kill the jobs"
]

Now you will import the TfidfVectorizer from the sklearn .feature_extraction.text. Call the TfidfVectorizer () method and Use the fit_transform() method on the corpus.

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

corpus = [
    " Machine learning is the future",
    " Future will be full of automation",
    " Automation will kill the jobs"
]
vectorizer = TfidfVectorizer(stop_words="english")
tf_idf = vectorizer.fit_transform(corpus)
print("Token's used as Features ")
print(vectorizer.get_feature_names(),"\n")
print("Size of the array")
print(tf_idf.shape,"\n")
print("TF-IDF Matrix\n")
print(tf_idf.toarray())

vectorizer.get_feature_names() will returns all the words that build the TF-IDF matrix. To output the TF-IDF matrix you have to first convert it an array and the print. The following is the output.

Conclusion

Advanced Text processing is a must task for every NLP programmer. Building N-grams, POS tagging, and TF-IDF have many use cases. Applying these depends upon your project. Use N-gram for prediction of the next word, POS tagging to do sentiment analysis or labeling the entity and TF-IDF to find the uniqueness of the document.

Hope this tutorial has answered all the queries about the advanced text processing. If you have any questions then you can contact or message us on our Data Science Learner Offical Page.