Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I read pdf in python? [duplicate]

How can I read pdf in python? I know one way of converting it to text, but I want to read the content directly from pdf.

Can anyone explain which module in python is best for pdf extraction

like image 739
sg1994 Avatar asked Aug 21 '17 10:08

sg1994


People also ask

How do I find duplicate data in PDF?

If not, open the combined PDF file. Select "Plug-Ins > Split Documents > Find and Delete Duplicate Pages..." to open the "Find Duplicate Pages" dialog. Check the "Compare visual appearance for exact match (can be used to compare images)" option. Click "OK" to start searching for duplicate pages.

How do I read data from a PDF in Python?

Use the PyPDF2 Module to Read a PDF in Python We open the PDF document in read binary mode using open('document_path. PDF', 'rb') . PDFFileReader() is used to create a PDF reader object to read the document. We can extract text from the pages of the PDF document using getPage() and extractText() methods.

How do I open a PDF file in Python?

Use the PyPDF2 Module to Read a PDF in Python PyPDF2 is a Python module that we can use to extract a PDF document’s information, merge documents, split a document, crop pages, encrypt or decrypt a PDF file, and more. We open the PDF document in read binary mode using open ('document_path.PDF', 'rb').

How to read scanned-in PDF files with Python?

pdfminer(specifically pdfminer.six, which is a more up-to-date fork of pdfminer) is an effective package to use if you’re handling PDFs that are typed and you’re able to highlight the text. On the other hand, to read scanned-in PDF files with Python, the pytesseractpackage comes in handy, which we’ll see later in the post.

How to extract text from a PDF file in Python?

PDFplumber is a Python module that we can use to read and extract text from a PDF document and other things. PDFplumber module is more potent as compared to the PyPDF2 module.

What is the best Python library for reading a PDF file?

I'm using pdfminer and it is an excellent lib especially if you're comfortable programming in python. It reads PDF and extracts every character, and it provides its bounding box as a tuple (x0,y0,x1,y1). Pdfminer will extract rectangles, lines and some images, and will try to detect words.


2 Answers

You can USE PyPDF2 package

#install pyDF2
pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('example.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)

Follow this Documentation [http://pythonhosted.org/PyPDF2/] https://pypdf2.readthedocs.io/en/latest/

like image 148
shankarj67 Avatar answered Oct 12 '22 21:10

shankarj67


You can use textract module in python

Textract

for install

pip install textract

for read pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

For detail Textract

like image 13
Kallz Avatar answered Oct 12 '22 21:10

Kallz