Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract text and text coordinates from a PDF file?

I want to extract all the text boxes and text box coordinates from a PDF file with PDFMiner.

Many other Stack Overflow posts address how to extract all text in an ordered fashion, but how can I do the intermediate step of getting the text and text locations?

Given a PDF file, output should look something like:

489, 41,  "Signature"
500, 52,  "b"
630, 202, "a_g_i_r"
like image 930
pnj Avatar asked Apr 06 '14 18:04

pnj


People also ask

How do I extract specific text from a PDF?

To extract information from a PDF in Acrobat DC, choose Tools > Export PDF and select an option. To extract text, export the PDF to a Word format or rich text format, and choose from several advanced options that include: Retain Flowing Text.

Can Tesseract extract text from PDF?

Using pytesseract, one can extract almost all the data irrespective of the format of the documents (whether its a scanned document or a pdf or a simple jpeg image).

How do I extract results from a PDF?

Both are available as trial versions. With the file to be processed open in Acrobat, select "Plug-Ins > Split Documents > Extract Pages By Text Search" from the main menu. Use this dialog to configure the text search. In this example, the goal is to extract any pages that contain the words “PAID” or “Total due: 0.00”.


3 Answers

Here's a copy-and-paste-ready example that lists the top-left corners of every block of text in a PDF, and which I think should work for any PDF that doesn't include "Form XObjects" that have text in them:

from pdfminer.layout import LAParams, LTTextBox from pdfminer.pdfpage import PDFPage from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.converter import PDFPageAggregator  fp = open('yourpdf.pdf', 'rb') rsrcmgr = PDFResourceManager() laparams = LAParams() device = PDFPageAggregator(rsrcmgr, laparams=laparams) interpreter = PDFPageInterpreter(rsrcmgr, device) pages = PDFPage.get_pages(fp)  for page in pages:     print('Processing next page...')     interpreter.process_page(page)     layout = device.get_result()     for lobj in layout:         if isinstance(lobj, LTTextBox):             x, y, text = lobj.bbox[0], lobj.bbox[3], lobj.get_text()             print('At %r is text: %s' % ((x, y), text)) 

The code above is based upon the Performing Layout Analysis example in the PDFMiner docs, plus the examples by pnj (https://stackoverflow.com/a/22898159/1709587) and Matt Swain (https://stackoverflow.com/a/25262470/1709587). There are a couple of changes I've made from these previous examples:

  • I use PDFPage.get_pages(), which is a shorthand for creating a document, checking it is_extractable, and passing it to PDFPage.create_pages()
  • I don't bother handling LTFigures, since PDFMiner is currently incapable of cleanly handling text inside them anyway.

LAParams lets you set some parameters that control how individual characters in the PDF get magically grouped into lines and textboxes by PDFMiner. If you're surprised that such grouping is a thing that needs to happen at all, it's justified in the pdf2txt docs:

In an actual PDF file, text portions might be split into several chunks in the middle of its running, depending on the authoring software. Therefore, text extraction needs to splice text chunks.

LAParams's parameters are, like most of PDFMiner, undocumented, but you can see them in the source code or by calling help(LAParams) at your Python shell. The meaning of some of the parameters is given at https://pdfminer-docs.readthedocs.io/pdfminer_index.html#pdf2txt-py since they can also be passed as arguments to pdf2text at the command line.

The layout object above is an LTPage, which is an iterable of "layout objects". Each of these layout objects can be one of the following types...

  • LTTextBox
  • LTFigure
  • LTImage
  • LTLine
  • LTRect

... or their subclasses. (In particular, your textboxes will probably all be LTTextBoxHorizontals.)

More detail of the structure of an LTPage is shown by this image from the docs:

Tree diagram of the structure of an <code>LTPage</code>. Of relevance to this answer: it shows that an <code>LTPage</code> contains the 5 types listed above, and that an <code>LTTextBox</code> contains <code>LTTextLine</code>s plus unspecified other stuff, and that an <code>LTTextLine</code> contains <code>LTChar</code>s, <code>LTAnno</code>s, <code>LTText</code>s, and unspecified other stuff.

Each of the types above has a .bbox property that holds a (x0, y0, x1, y1) tuple containing the coordinates of the left, bottom, right, and top of the object respectively. The y-coordinates are given as the distance from the bottom of the page. If it's more convenient for you to work with the y-axis going from top to bottom instead, you can subtract them from the height of the page's .mediabox:

x0, y0_orig, x1, y1_orig = some_lobj.bbox y0 = page.mediabox[3] - y1_orig y1 = page.mediabox[3] - y0_orig 

In addition to a bbox, LTTextBoxes also have a .get_text() method, shown above, that returns their text content as a string. Note that each LTTextBox is a collection of LTChars (characters explicitly drawn by the PDF, with a bbox) and LTAnnos (extra spaces that PDFMiner adds to the string representation of the text box's content based upon the characters being drawn a long way apart; these have no bbox).

The code example at the beginning of this answer combined these two properties to show the coordinates of each block of text.

Finally, it's worth noting that, unlike the other Stack Overflow answers cited above, I don't bother recursing into LTFigures. Although LTFigures can contain text, PDFMiner doesn't seem capable of grouping that text into LTTextBoxes (you can try yourself on the example PDF from https://stackoverflow.com/a/27104504/1709587) and instead produces an LTFigure that directly contains LTChar objects. You could, in principle, figure out how to piece these together into a string, but PDFMiner (as of version 20181108) can't do it for you.

Hopefully, though, the PDFs you need to parse don't use Form XObjects with text in them, and so this caveat won't apply to you.

like image 135
Mark Amery Avatar answered Sep 24 '22 06:09

Mark Amery


Newlines are converted to underscores in final output. This is the minimal working solution that I found.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer

# Open a PDF file.
fp = open('/Users/me/Downloads/test.pdf', 'rb')

# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)

# Create a PDF document object that stores the document structure.
# Password for initialization as 2nd parameter
document = PDFDocument(parser)

# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()

# Create a PDF device object.
device = PDFDevice(rsrcmgr)

# BEGIN LAYOUT ANALYSIS
# Set parameters for analysis.
laparams = LAParams()

# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)

# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)

def parse_obj(lt_objs):

    # loop over the object list
    for obj in lt_objs:

        # if it's a textbox, print text and location
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            print "%6d, %6d, %s" % (obj.bbox[0], obj.bbox[1], obj.get_text().replace('\n', '_'))

        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

# loop over all pages in the document
for page in PDFPage.create_pages(document):

    # read the page into a layout object
    interpreter.process_page(page)
    layout = device.get_result()

    # extract text from this object
    parse_obj(layout._objs)
like image 30
pnj Avatar answered Sep 22 '22 06:09

pnj


Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, pdfminer.six has multiple API's to extract text and information from a PDF. For programmatically extracting information I would advice to use extract_pages(). This allows you to inspect all of the elements on a page, ordered in a meaningful hierarchy created by the layout algorithm.

The following example is a pythonic way of showing all the elements in the hierachy. It uses the simple1.pdf from the samples directory of pdfminer.six.

from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages


def show_ltitem_hierarchy(o: Any, depth=0):
    """Show location and text of LTItem and all its descendants"""
    if depth == 0:
        print('element                        x1  y1  x2  y2   text')
        print('------------------------------ --- --- --- ---- -----')

    print(
        f'{get_indented_name(o, depth):<30.30s} '
        f'{get_optional_bbox(o)} '
        f'{get_optional_text(o)}'
    )

    if isinstance(o, Iterable):
        for i in o:
            show_ltitem_hierarchy(i, depth=depth + 1)


def get_indented_name(o: Any, depth: int) -> str:
    """Indented name of LTItem"""
    return '  ' * depth + o.__class__.__name__


def get_optional_bbox(o: Any) -> str:
    """Bounding box of LTItem if available, otherwise empty string"""
    if hasattr(o, 'bbox'):
        return ''.join(f'{i:<4.0f}' for i in o.bbox)
    return ''


def get_optional_text(o: Any) -> str:
    """Text of LTItem if available, otherwise empty string"""
    if hasattr(o, 'get_text'):
        return o.get_text().strip()
    return ''


path = Path('~/Downloads/simple1.pdf').expanduser()

pages = extract_pages(path)
show_ltitem_hierarchy(pages)

The output shows the different elements in the hierarchy. The bounding box for each. And the text that this element contains.

element                        x1  y1  x2  y2   text
------------------------------ --- --- --- ---- -----
generator                       
  LTPage                       0   0   612 792  
    LTTextBoxHorizontal        100 695 161 719  Hello
      LTTextLineHorizontal     100 695 161 719  Hello
        LTChar                 100 695 117 719  H
        LTChar                 117 695 131 719  e
        LTChar                 131 695 136 719  l
        LTChar                 136 695 141 719  l
        LTChar                 141 695 155 719  o
        LTChar                 155 695 161 719  
        LTAnno                  
    LTTextBoxHorizontal        261 695 324 719  World
      LTTextLineHorizontal     261 695 324 719  World
        LTChar                 261 695 284 719  W
        LTChar                 284 695 297 719  o
        LTChar                 297 695 305 719  r
        LTChar                 305 695 311 719  l
        LTChar                 311 695 324 719  d
        LTAnno                  
    LTTextBoxHorizontal        100 595 161 619  Hello
      LTTextLineHorizontal     100 595 161 619  Hello
        LTChar                 100 595 117 619  H
        LTChar                 117 595 131 619  e
        LTChar                 131 595 136 619  l
        LTChar                 136 595 141 619  l
        LTChar                 141 595 155 619  o
        LTChar                 155 595 161 619  
        LTAnno                  
    LTTextBoxHorizontal        261 595 324 619  World
      LTTextLineHorizontal     261 595 324 619  World
        LTChar                 261 595 284 619  W
        LTChar                 284 595 297 619  o
        LTChar                 297 595 305 619  r
        LTChar                 305 595 311 619  l
        LTChar                 311 595 324 619  d
        LTAnno                  
    LTTextBoxHorizontal        100 495 211 519  H e l l o
      LTTextLineHorizontal     100 495 211 519  H e l l o
        LTChar                 100 495 117 519  H
        LTAnno                  
        LTChar                 127 495 141 519  e
        LTAnno                  
        LTChar                 151 495 156 519  l
        LTAnno                  
        LTChar                 166 495 171 519  l
        LTAnno                  
        LTChar                 181 495 195 519  o
        LTAnno                  
        LTChar                 205 495 211 519  
        LTAnno                  
    LTTextBoxHorizontal        321 495 424 519  W o r l d
      LTTextLineHorizontal     321 495 424 519  W o r l d
        LTChar                 321 495 344 519  W
        LTAnno                  
        LTChar                 354 495 367 519  o
        LTAnno                  
        LTChar                 377 495 385 519  r
        LTAnno                  
        LTChar                 395 495 401 519  l
        LTAnno                  
        LTChar                 411 495 424 519  d
        LTAnno                  
    LTTextBoxHorizontal        100 395 211 419  H e l l o
      LTTextLineHorizontal     100 395 211 419  H e l l o
        LTChar                 100 395 117 419  H
        LTAnno                  
        LTChar                 127 395 141 419  e
        LTAnno                  
        LTChar                 151 395 156 419  l
        LTAnno                  
        LTChar                 166 395 171 419  l
        LTAnno                  
        LTChar                 181 395 195 419  o
        LTAnno                  
        LTChar                 205 395 211 419  
        LTAnno                  
    LTTextBoxHorizontal        321 395 424 419  W o r l d
      LTTextLineHorizontal     321 395 424 419  W o r l d
        LTChar                 321 395 344 419  W
        LTAnno                  
        LTChar                 354 395 367 419  o
        LTAnno                  
        LTChar                 377 395 385 419  r
        LTAnno                  
        LTChar                 395 395 401 419  l
        LTAnno                  
        LTChar                 410 395 424 419  d
        LTAnno                  

(Similar answer here, here and here , I'll try to keep them in sync.)

like image 45
Pieter Avatar answered Sep 20 '22 06:09

Pieter