Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text from a PDF file?

Tags:

python

pdf

People also ask

How can I extract text from PDF?

Once you've opened the file, click on the "Edit" tab, and then click on the "edit" icon. Now you can right-click on the text and select "Copy" to extract the text you need.

How can I extract text from a PDF image?

You can capture text from a scanned image, upload your image file from your computer, or take a screenshot on your desktop. Then simply right click on the image, and select Grab Text. The text from your scanned PDF can then be copied and pasted into other programs and applications.


I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

from tika import parser # pip install tika

raw = parser.from_file('sample.pdf')
print(raw['content'])

Note that Tika is written in Java so you will need a Java runtime installed


Use textract.

  • http://textract.readthedocs.io/en/latest/
  • https://github.com/deanmalmgren/textract

It supports many types of files including PDFs

import textract
text = textract.process("path/to/file.extension")

I recommend to use pymupdf or pdfminer.six.

Those packages are not maintained:

  • PyPDF2, PyPDF3, PyPDF4
  • pdfminer (without .six)

How to read pure text with pymupdf

There are different options which will give different results, but the most basic one is:

import fitz  # this is pymupdf

with fitz.open("my.pdf") as doc:
    text = ""
    for page in doc:
        text += page.getText()

print(text)

Other PDF libraries

  • pikepdf does not support text extraction (source)

Look at this code:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

The output is:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

Using the same code to read a pdf from 201308FCR.pdf .The output is normal.

Its documentation explains why:

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """