In the internet world, you will see a large amount of text data. Some of the examples are texts from emails, blogs, messages, and comments on social networks. Due to a large amount of text for every data professional, analyzing these text and retrieving some useful information from it is a very useful and interesting task. In this intuition Lets know Introduction to NLTK python packages for reading, exploring, and analyzing the text. You will learn the following things.
An Introduction to NLTK ( Terminology) :
Here are few terminologies for NLTK –
The document is a collection of sentences that represents a specific fact that is also known as an entity. It consists of paragraphs, words, and sentences. Some of the examples of documents are a software log file, product review. Tweets of a specific user in a particular context. In the database context document is a record in the data.
You already know the term document. In-text mining, the collection of similar documents are known as corpus. Documents inside the corpus are always related to some specific entity or the time period. For example, tweets of a user account in a month. Corpus of daily log files or product reviews in a particular month. You can think corpus as a table in the database.
Introduction to NLTK: Programming Examples
It is a platform that helps you to write python code that works with the human language data. NLTK has various libraries and packages for NLP( Natural Language Processing ). It has more than 50 corpora and lexical resources for processing and analyzes texts like classification, tokenization, stemming, tagging e.t.c. Some of them are Punkt Tokenizer Models, Web Text Corpus, WordNet, SentiWordNet. You can look at all these corpora on the official NLTK link. NLTK Corpora Data
Steps to Read and Analyze the Sample Text
Step 1: Import the necessary libraries
In this step, I will use the Python standard os module and NLTK Library. You can install NLTK using pip install NLTK for the python 2 version and for the Python 3.x+ version use pip3 install NLTK.
import os import nltk nltk.download("punkt")
Step 2: Define the filename and path
Although I am coding in Jupyter notebook. I can directly access the file path. But the best practice is to lets OS decide the current working directory. It will be very helpful if you do projects in Pycharm.
PATH = os.getcwd()
FILE_NAME = "sample.txt"
Step 2: Read the text
You will use the NLTK PlaintextCorpusReader and pass the path of the sample text in the PlaintextCorpusReader(). I am assigning it a separate variable corpus.
from nltk.corpus.reader.plaintext import PlaintextCorpusReader corpus = PlaintextCorpusReader(PATH,FILE_NAME)
Step 4: Explore the corpus
In this step, you can manipulate the corpus text. Like for example printing the raw Corpus data use
It returns all the files names in the corpus file as a list.
Extracting Paragraph and length of Paragraph
#extrace paragraphs from the corpus paragraph = corpus.paras() print("Total Paragraphs:", len(paragraph))
The corpus.pars() method will find all the paragraphs in a corpus. To find the total number of the paragraph you use the len() method.
#extract sentences from the corpus sentences = corpus.sents() print("Total Sentences:",len(sentences)) print("First sentence:",sentences)
To find the total number of sentences in a corpus first you will use the sents() method and then pass it to the len() method.
#extract words from the corpos print("Words in the corpus:",corpus.words())
Using the corpus.words(), you can find the list of all the words in a corpus.
Step 5: Analyze the corpus
freq = nltk.FreqDist(corpus.words()) #common words print("Common Words:", freq.most_common(10)) #specific words print("Specific Word: ", freq.get("project"))
NLTK has many functions for the frequency distribution analysis of the data. You can find more functions here. Like in this example, I am finding the top 10 occurrences of words (freq.most_common(10)) and the finding the repentance of a specific word (freq.get(“project”)).
All the above steps are only for reading and doing some analysis on a single document in a corpus. You can use any numbers of similar documents. In the next tutorial, you will know to Cleanization and extraction of text like removing stop words, tokenization, stemming, and lemmatization. Therefore stay tuned.
Further Reading for Introduction to NLTK:
- Advanced Text Processing using NLTK: The Complete Guide
In the meantime, You can contact us if you have any suggestions and queries.
Data Science Learner Team
Join our list
Subscribe to our mailing list and get interesting stuff and updates to your email inbox.