How to Preprocess Text Data in Python

How to Preprocess Text Data in Python ? 4 Processes

Text Preprocessing is a must requirement for any NLP or Data Science Programmer. Just like you do preprocessing for building your machine learning model. In the same way for extracting the meaningful information from the text, you are required to do some preprocessing tasks. In this tutorial, you will learn how to preprocess text data in python using the Python Module NLTK.

I already explain what is NLTK and what are its use cases. If you want to read then read the post on Reading and Analyze the Corpus using NLTK.

You will learn the following things here.
Tokenization of the text.
Cleaning of the Text
Removal of the Stop words
Lemmatization of the words

First of all, doing any cleaning process on the text you should first read the raw text file and import the necessary libraries. I am reading from the local directory. However, you can also read the corpus from a specific URL or server.

import os
import nltk
file = open(os.getcwd()+ "/sample.txt","rt")
raw_text = file.read()
file.close()

How to do tokenization of text?

You usually do it in a corpus to break down the text into words, symbols, sentences, paragraphs, and other meaningful elements. It must be done for the future clearing of the text. Without it you can not properly clean the text like punctuation, stop words e.t.c.

For tokenization, nltk has a method word_tokenize(). It will break down the raw text and return as a list. You can also check the difference of future cleaning using the size of the token list.

#tokenize 
token_list = nltk.word_tokenize(raw_text)
print(token_list[0:20],"\n")
print("Total tokens : ", len(token_list))

Tokenization of the text

How to remove punctuation in a text using nltk?

After tokenization of the text, the further step is to convert uppercase words into lower case and removing punctuations. For lower case conversion you will use the python inbuilt method lower() to the tokenizer list. And for removing punctuation you will use PunktToken(). Here I am using the lambda function and filtering the list by comparing each token with the nltk punctuation words.

#Conversion of all words to lower case
token_list3 = [word.lower() for word in token_list2]
print(token_list3[0:20],"\n")
print("Total tokens : ", len(token_list3))

Cleaning the Text words to lower

from nltk.tokenize import punkt
token_list2 = list(filter(lambda token : punkt.PunktToken(token).is_non_punct,token_list))
print(token_list2[0:20],"\n")
print("Total tokens : ", len(token_list2))

Cleaning the Text punctuation

How to remove Stopwords?

Stop words does not contribute to the text analysis as they don’t have any meaning. Example of the stop words are like in, the, and which e.t.c. It’s better that you should remove from them. Nltk has already the list of the stop words you can use them to compare your tokenize words. You can download the stop words using nltk.download(“stopwords”). After that import stopwords from the nltk corpus. Like, In this case, I am using the lambda function for filtering the tokens that are not in the stop words and assigning these to the new token list variable.

from nltk.corpus import stopwords
#remove stopwords
token_list4 = list(filter(lambda token: token not in stopwords.words("english"),token_list3))
print(token_list4[0:20],"\n")
print("Total tokens : ", len(token_list4))

Removal of the Stop words

Lemmatization of the words

It is an important step in the text preprocessing. It produces the root word that is generated from it. For example developed, developing have the root words that is “develop”. It is the lemmatized version of the word developed and developing. Lemmatization uses the dictionary to match each word with the root words.

For the lemmatization, you have to first download the wordnet from the nltk using nltk.download(“wordnet”). After that, you will use the WordNetLemmatizer() for lemmatizing the tokenized list.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
token_list5 = [lemmatizer.lemmatize(word) for word in token_list4]
print(token_list5[0:20],"\n")
print("Total tokens : ", len(token_list5))

Lemmatization of the words

End Notes

Just like you do data preprocessing in machine learning, you also have to do text preprocessing. These are some of the basic steps to know How to Preprocess Text Data in Python? In this entire tutorial, I described only basic steps. In the next post, you will know the advanced way to do text preprocessing. Therefore keep visiting our site for the next tutorial. If you want to learn more then contact us or message us at our official Facebook Page.

 

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner