Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best tool for text extraction from PDF in Python 3.4 [closed]

Tags:

python-3.x

pdf

I am using Python 3.4 and need to extract all the text from a PDF and then use it for text processing.

All the answers I have seen suggest options for Python 2.7.

I need something in Python 3.4.

Bonson

like image 464
Bonson Avatar asked Sep 19 '15 11:09

Bonson


People also ask

Can I extract data from PDF using Python?

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.


1 Answers

You need to install PyPDF2 module to be able to work with PDFs in Python 3.4. PyPDF2 cannot extract images, charts or other media but it can extract text and return it as a Python string. To install it run pip install PyPDF2 from the command line. This module name is case-sensitive so make sure to type 'y' in lowercase and all other characters as uppercase.

>>> import PyPDF2 >>> pdfFileObj = open('my_file.pdf','rb')     #'rb' for read binary mode >>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj) >>> pdfReader.numPages 56 >>> pageObj = pdfReader.getPage(9)          #'9' is the page number >>> pageObj.extractText() 

last statement returns all the text that is available in page-9 of 'my_file.pdf' document.

like image 145
JohnnyBravo-xyz Avatar answered Oct 07 '22 23:10

JohnnyBravo-xyz