In this post, we will learn How to extract text from a PDF file which we can perform using two different ways which are by desired file plumber library and PyPDF2 library here we will discuss both of the methods with examples along with implementable code as an example to understand it better.
Getting The Text From a PDF
As we are extracting the text from desired file here so we have two different ways to do so and here we will discuss both of the methods one by one and understand how we can get it and how both the methods are different from each other.
By Using PyPDF2
It is a library in python that is used for getting the text from a given desired file in python and to get it done here is a code for implementation.
import PyPDF2 # open the desired file file in read-binary mode with open(' prakhar.pdf', 'rb') as pdf_file: # create a desired file reader object pdf_reader = PyPDF2.PdfFileReader(pdf_file) # get the number of pages in the desired file file num_ pages = pdf_reader. numPages # loop through each page in the PDF file for page in range( num_pages): # get the page object pdf_page = pdf_reader.getPage( page) # extract the text from the page page_text = pdf_page. extract text() # print the text from the page print( page_text)
As in this given example, we used the PyPDF2 library of python by importing it directly to extract text from a desired file file named prakhr.pdf We open the file in read-binary mode using a statement that is with which could also be used at different places of we want as it is a very powerful tool to use.
After that we created a desired file reader object for reading by object using the user-made function, desired file file reader(), and get the number of pages in the PDF file using the num page attribute.
Next, we loop through each page in the desired file file using a loop. For each page, we get the page object using the get page function and extract the text from the page using the extracting the text we used to extract text() function. Finally, we print the text from the page.
Using pdfplumber Library
To implement the same use desired file plumber library we have given an example below to understand it and implement it.
import pdfplumber # open the PDF file with pdfplumber.open( 'prakhar.pdf') as pdf: # loop through each page in the PDF file for page in pdf. pages: # extract the text from the page page_text = page.extract_text() # print the text from the page print( page_text)
Here also we are simply reading the prakhar.pdf and extracting the text as we want to get.
To learn more about How to extract text from a PDF file visit: by geeks for geeks.
To learn more about MongoDB and tutorial related to it visit: MongoDB Problems And Tutorials