I am using Python 3.4 and need to extract all the text from a PDF and then use it for text processing.
All the answers I have seen suggest options for Python 2.7.
I need something in Python 3.4.
Bonson
There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.
You need to install PyPDF2 module to be able to work with PDFs in Python 3.4. PyPDF2 cannot extract images, charts or other media but it can extract text and return it as a Python string. To install it run pip install PyPDF2
from the command line. This module name is case-sensitive so make sure to type 'y' in lowercase and all other characters as uppercase.
>>> import PyPDF2 >>> pdfFileObj = open('my_file.pdf','rb') #'rb' for read binary mode >>> pdfReader = PyPDF2.PdfFileReader(pdfFileObj) >>> pdfReader.numPages 56 >>> pageObj = pdfReader.getPage(9) #'9' is the page number >>> pageObj.extractText()
last statement returns all the text that is available in page-9 of 'my_file.pdf' document.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With