How to Extract text from Pdf file in Python

How to Extract text from Pdf file in Python

Suppose you have a pdf file and want to extract all the text of it then how you can do so? In this tutorial, you will know how to extract all text or specifics from a pdf file.

Extract text from Pdf file in Python Implementation

Let’s implement the method to extract text from a pdf file.

In this method, you will code in such a way that you will extract all the text written in the pdf file. Here I will define a function named “extract_text_from_pdf” that will return the extracted text.

Below are the lines of code for the function.

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfFileReader(file)
        num_pages = reader.getNumPages()

        for page_number in range(num_pages):
            page = reader.getPage(page_number)
            text += page.extractText()

    return text

Explanation of the code.

Here I am first assigning the text variable that will initialize with no text in it. After that using the open() function  I will open the pdf file and use the PyPDF2.PdfFileReader( file) to call the reader. It will first get the number of pages. After that, you will use the for loop to iterate to each page and extract text from each page using the extractText() function.

Below is the full code for the above method.

import PyPDF2

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)

        for page_number in range(num_pages):
            page = reader.pages[page_number]
            text += page.extract_text()

    return text

if __name__ == "__main__":
    pdf_file_path = "files/file1.pdf"
    extracted_text = extract_text_from_pdf(pdf_file_path)
    print(extracted_text)

Output

Extract text from pdf output

 

Search for Specific Text in Pdf file

You can also search for a specific text in a pdf file. If the text is found in the page then the entire text will be extracted from pdf file, other wise the function will return no text.

Below is the complete code.

import PyPDF2

def extract_specific_text_from_pdf(pdf_path, search_text):
    extracted_text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        num_pages = len(reader.pages)

        for page_number in range(num_pages):
            page = reader.pages[page_number]
            page_text = page.extract_text()

            if search_text in page_text:
                extracted_text += page_text

    return extracted_text


if __name__ == "__main__":
    pdf_file_path = "files/file1.pdf"
    target_text = "data"
    extracted_text = extract_specific_text_from_pdf(pdf_file_path, target_text)
    print(extracted_text)

You can see most of the lines of code inside the function is the same. Here just the text is extracted from the pdf file only when the specific text is available.

Conclusion

These are the methods to extract text from a pdf file. If you have more than one pdf file and want to extract text from entire files then you have to first merge all the pdf files and then use the above method to extract the entire text.

I hope you have liked this tutorial. If you have any queries then you can contact us for more help.

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.

Meet Sukesh ( Chief Editor ), a passionate and skilled Python programmer with a deep fascination for data science, NumPy, and Pandas. His journey in the world of coding began as a curious explorer and has evolved into a seasoned data enthusiast.
 
Thank you For sharing.We appreciate your support. Don't Forget to LIKE and FOLLOW our SITE to keep UPDATED with Data Science Learner