An Introduction to NLTK : Read and Analyze The Corpus using NLTK

In the internet world, you will see a large amount of text data. Some of the examples are texts from emails, blogs, messages, and comments on social networks. Due to a large amount of text for every data professional, analyzing these text and retrieving some useful information from it is a very useful and interesting task. In this intuition Lets know Introduction to NLTK python packages for reading, exploring, and analyzing the text. You will learn the following things.

An Introduction to NLTK ( Terminology) :

Here are few terminologies for NLTK –

Document

The document is a collection of sentences that represents a specific fact that is also known as an entity. It consists of paragraphs, words, and sentences. Some of the examples of documents are a software log file, product review. Tweets of a specific user in a particular context. In the database context document is a record in the data.

Corpus

You already know the term document. In-text mining, the collection of similar documents are known as corpus. Documents inside the corpus are always related to some specific entity or the time period. For example, tweets of a user account in a month. Corpus of daily log files or product reviews in a particular month. You can think corpus as a table in the database.

Introduction to NLTK: Programming Examples

It is a platform that helps you to write python code that works with the human language data. NLTK has various libraries and packages for NLP( Natural Language Processing ). It has more than 50 corpora and lexical resources for processing and analyzes texts like classification, tokenization, stemming, tagging e.t.c. Some of them are Punkt Tokenizer Models, Web Text Corpus, WordNet, SentiWordNet. You can look at all these corpora on the official NLTK link. NLTK Corpora Data

Steps to Read and Analyze the Sample Text

Step 1: Import the necessary libraries

In this step, I will use the Python standard os module and NLTK Library. You can install NLTK using pip install NLTK for the python 2 version and for the Python 3.x+ version use pip3 install NLTK.

import os
import nltk
nltk.download("punkt")

Step 2: Define the filename and path

Although I am coding in Jupyter notebook. I can directly access the file path. But the best practice is to lets OS decide the current working directory. It will be very helpful if you do projects in Pycharm.

PATH = os.getcwd() FILE_NAME = "sample.txt"

Step 2: Read the text

You will use the NLTK PlaintextCorpusReader and pass the path of the sample text in the PlaintextCorpusReader(). I am assigning it a separate variable corpus.

from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus = PlaintextCorpusReader(PATH,FILE_NAME)

Step 4: Explore the corpus

In this step, you can manipulate the corpus text. Like for example printing the raw Corpus data use

print(corpus.raw() )

File IDs
It returns all the files names in the corpus file as a list.

print(corpus.fileids())

Extracting Paragraph and length of Paragraph

#extrace paragraphs from the corpus 
paragraph = corpus.paras()
print("Total Paragraphs:", len(paragraph))

The corpus.pars() method will find all the paragraphs in a corpus. To find the total number of the paragraph you use the len() method.

Sentences Extracting

#extract sentences from the corpus 
sentences = corpus.sents()
print("Total Sentences:",len(sentences))
print("First sentence:",sentences[0])

To find the total number of sentences in a corpus first you will use the sents() method and then pass it to the len() method.

Extracting Words

#extract words from the corpos
print("Words in the corpus:",corpus.words())

Using the corpus.words(), you can find the list of all the words in a corpus.

Step 5: Analyze the corpus

freq = nltk.FreqDist(corpus.words())
#common words 
print("Common Words:", freq.most_common(10))
#specific words 
print("Specific Word: ", freq.get("project"))

NLTK has many functions for the frequency distribution analysis of the data. You can find more functions here. Like in this example, I am finding the top 10 occurrences of words (freq.most_common(10)) and the finding the repentance of a specific word (freq.get(“project”)).

All the above steps are only for reading and doing some analysis on a single document in a corpus. You can use any numbers of similar documents. In the next tutorial, you will know to Cleanization and extraction of text like removing stop words, tokenization, stemming, and lemmatization. Therefore stay tuned.