How to Extract Text from PDF File Using Python ? 4 Steps Only

PDF contains unstructured data and making it meaningful or structured is a challenging task. It contains much useful Information that If you make a predictive or NLP model then it will beneficial to you. Currently, There are many libraries that allow you to manipulate the PDF File using Python. Like extracting text, tables, images and many things from PDF using it. These are also used in doing text analysis. In this entire tutorial of “How to,” you will learn how to extract text from PDF File using Python.

Step By Step Guide to Extract Text

Step 1: Import the necessary libraries

Although there are many libraries available for extracting text from PDF File. Here for the demonstration purpose, I am using PyPDF2.

import PyPDF2

Step 2: Open the PDF File

Now using the PYPDF2 you will Open the PDF File in RB(reading in bytes) mode.

# open the pdf file
pdf_file =open('data/FOMC_report.pdf', 'rb')

Step 3: Read PDF and Check for Encryption

After opening the file Read the PDF File using PyPDF2.PdfFileReader() method and check for encryption using getIsEncrypted() method. It is a must as with encryption you cannot read the PDF File and extract the text. Use the Code Below.

# read pdf
read_pdf = PyPDF2.PdfFileReader(pdf_file)

#check pdf is encrypted or not
read_pdf.getIsEncrypted()

# no of pages
read_pdf.numPages

Step 4: Extract the text

After knowing the number of the pages, you can extract text from it using the getPage() and extractText() method. The getPage() method will first get the page number of the Pdf file and extractText() will extract the text from that page number. In our example lets say I want to extract text from page number 1 then I will use the following code.

# extract text from page number 1
page1 = read_pdf.getPage(0)
page1.extractText()

Before Splitting

If you see the output then a new line is replaced with \n. Now you can easily split the sentence using split(‘\n’) method. It will convert the extracted text to the list.

After Splitting

Conclusion

Converting Unstructured Text data from PDF to structured data is beneficial for you if you want to use Natural Language Processing (NLP). After extracting text data from PDF you can do anything like text preprocessing, word anagrams e.t.c.

Hope this post has solved your query on how to extract text from PDF File using Python. Please contact us if you have any query regarding anything. We are always ready to help you.