How to use PDFminer.six with python 3?

Tags:

I want to use pdfminer.six which is a tool, that can be used with Python3 for extracting information from PDF documents. The problem is there is no good documentation at all and no source code example on how to use the tool.

I have already tried some code from StackOverflow but it didn't work. Below is my code.

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

I want some code example on how to use this tool to get data from PDFs.

906

asked Jun 07 '19 12:06

Urvish

2 Answers

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, it has multiple api's to extract text from a PDF, depending on your needs. Behind the scenes, all of these api's use the same logic for parsing and analyzing the layout.

(All the examples assume your PDF file is called example.pdf)

Commandline

If you want to extract text just once you can use the commandline tool pdf2txt.py:

$ pdf2txt.py example.pdf

High-level api

If you want to extract text (properties) with Python, you can use the high-level api. This approach is the go-to solution if you want to programmatically extract information from a PDF.

from pdfminer.high_level import extract_text

# Extract text from a pdf.
text = extract_text('example.pdf')

# Extract iterable of LTPage objects.
pages = extract_pages('example.pdf')

Composable api

There is also a composable api that gives a lot of flexibility in handling the resulting objects. For example, it allows you to create your own layout algorithm. This method is suggested in the other answers, but I would only recommend this when you need to customize some component.

from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('example.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

Similar question and answers here. I'll try to keep them in sync.

133

answered Sep 19 '22 14:09

Pieter

Install pdfminer.six or pdfminer3 (https://github.com/gwk/pdfminer3/) install: pip install pdfminer3 I switched to pdfminer3 when I upgraded to 3.7 from 3.6 I use on ubuntu and macos with python 3.7.3

pdfminer3 comes with two handy tools: pdf2txt.py and dumppdf.py examine the source. Fairly small and easy to understand.

Following is a working example (once the location of the pdf file is added)

from pdfminer3.layout import LAParams, LTTextBox
from pdfminer3.pdfpage import PDFPage
from pdfminer3.pdfinterp import PDFResourceManager
from pdfminer3.pdfinterp import PDFPageInterpreter
from pdfminer3.converter import PDFPageAggregator
from pdfminer3.converter import TextConverter
import io

resource_manager = PDFResourceManager()
fake_file_handle = io.StringIO()
converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
page_interpreter = PDFPageInterpreter(resource_manager, converter)

with open('/path/to/file.pdf', 'rb') as fh:

    for page in PDFPage.get_pages(fh,
                                  caching=True,
                                  check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()

# close open handles
converter.close()
fake_file_handle.close()

print(text)

answered Sep 18 '22 14:09

LaVar

Related questions
                            
                                Is the order of results of re.findall guaranteed?
                            
                                How do I convert user input into a list?
                            
                                "GetPassWarning: Can not control echo on the terminal" when running from IDLE
                            
                                filter object becomes empty after iteration? [duplicate]
                            
                                Pyinstaller 3.3.1 & 3.4.0-dev build with apscheduler
                            
                                How can I run an async function using the schedule library?
                            
                                Python. Extract last digit of a string from a Pandas column
                            
                                Python can't install Box2D swig.exe failed with error code 1
                            
                                Aliases for commands with Python cmd module
                            
                                How to send an email with style in Python3?
                            
                                Writing binary data to a file in Python
                            
                                Python 3: How can object be instance of type?
                            
                                OSError: 269892000 requested and 269188084 written
                            
                                Python Break Inside Function [duplicate]
                            
                                Speckle ( Lee Filter) in Python
                            
                                Python3 regex on bytes variable [duplicate]
                            
                                How to print out 'Live' mouse position coordinates using pyautogui?
                            
                                How to make a post request with the Python requests library?
                            
                                Python argparse: Leading dash in argument
                            
                                How to load CSV file in Jupyter Notebook?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use PDFminer.six with python 3?

Tags:

python-3.x

pypdf2

pdfminer

Urvish

People also ask

2 Answers

Pieter

LaVar

Recent Activity

Donate For Us