How to Create a WordCloud ? Display in 3 Steps using NLTK

You must have heard about the word cloud in text analytics. It is a graphical display of the words that are present in a corpus. Size of the words depends upon occurrences of it. More the number of times a word comes in a corpus more will be its size. You can use the word cloud to show the most popular words inside the corpus. In this tutorial on “how to “, you will know how to create a wordcloud from a corpus.

Before going the coding parts of the wordcloud let’s know some of the use cases of it.

Popular Hashtags
Most Popular players Name
Presenting a Qualitative Survey Data
Analyzation of SEO keywords

Steps by Steps to create a word cloud

Step 1: Load the text corpus

First of all import your text data, you want to create wordcloud. Here I am taking text data from the directory itself for learning purpose. However, you can use url also for reading of the text. I am using the python os module for getting the path of the text file. You can download the text file from the GitHub URL.

import os
path = os.path.join(os.getcwd(),"sample_corpus.txt")
with open(path,"r") as fh:
        file_data = fh.read()
        
file_data[0:200]

os.getcwd() will return the current working directory and os.path.join() will give the full path for the sample_corpus.txt. Now open the file from the defined path stored in the memory using the file_data.

Step 2: Import the WordCloud and stopwords

In this step, you will import the wordcloud and stopwords from the word cloud module. After that, you will call the WordCloud() constructor and pass the following arguments into it that are stopwords , max_words, background_color. Like in this example I am taking max_words = 25, and backgorund_color = ” white” . And generate wordcloud on the text file file_data.

from wordcloud import WordCloud,STOPWORDS
#create stop words 
stopwords = set(STOPWORDS)
#call the wordcloud Constructor 
WC = WordCloud(stopwords=stopwords,max_words=25,background_color="white").generate(file_data)

Step 3: Display the Wordcloud

You can use any python visualization packages for displaying the wordcloud. Here I am using the popular and open source matplotlib.

import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams["figure.figsize"] = (10,5)
plt.imshow(WC)
plt.axis("off")
plt.show()

Here you are importing the matplotlib packages as plt. The statement plt.imshow() display the image in the axes. The figure will not look good if you are showing wordcloud with the axes. Therefore I write the statement plt.axis(“off”).

Enhancing the Wordcloud Display

Sometime in the wordcloud words come that you do not want to consider. Therefore to remove it from the wordcloud you will update the stopwords. For example, I don’t want these words to consider in the wordcloud that are article, will, becomes, word, document. This I will update it with stopwords. Following is the full code for enhancing the wordcloud.

stopwords.update(["article","will","becomes","word","document"])
WC = WordCloud(stopwords=stopwords,max_words=25,background_color="white").generate(file_data)
plt.imshow(WC)
plt.axis("off")
plt.show()

End Notes

Wordcloud is very useful for visualization of the text data. It tells you the words that are more important and lies. You can use it many applications like what are the words are focusing on a given statement and take find the importance of the word. For example, I want to read the monetary policies of a country. Then I will use the text data and find the most focusing area by the banks like employment, inflation, interest rate. There are other cases also you can search on the internet.

We hope that you must have liked the wordcloud tutorial. If you have any query on this. Then you can contact or message us on the official Data Science Learner Facebook Page.