Extracting text from a PDF file using the pypdf library.
Python package pypdf can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files. Note: For more information, refer to Working with PDF files in Python
Installation
To install this package type the below command in the terminal.
pip install pypdf
Example: Input PDF:
Python3
# importing required modules from pypdf import PdfReader # creating a pdf reader object reader = PdfReader( 'example.pdf' ) # printing number of pages in pdf file print ( len (reader.pages)) # getting a specific page from the pdf file page = reader.pages[ 0 ] # extracting text from page text = page.extract_text() print (text) |
Output:
Let us try to understand the above code in chunks:
reader = PdfReader('example.pdf')
- We created an object of PdfReader class from the pypdf module.
- The PdfReader class takes a required positional argument of the path to the pdf file.
print(len(reader.pages))
- pages property gives a List of PageObjects. So, here we can use the in-built len() function of python to get the number of pages in the pdf file.
page = reader.pages[0]
- Now, as reader.pages is a list of PageObjects, we can get a specific Page of the pdf by tapping into the index of the page. In python list indexing starts from 0, so reader.pages[0] gives us the first page of the pdf file.
text = page.extract_text()
print(text)
- Page object has function extract_text() to extract text from the pdf page.
Extract text from PDF File using Python
All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
We will extract text from pdf files using two Python libraries, pypdf and PyMuPDF, in this article.