Extracting tables from a pdf

Tags:

I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables.

This is what one of the tables looks like: enter image description here

As you can see, some columns are marked with an 'x'. I'm trying to this table into a list of objects.

This is the code so far, I'm using pdfminer now.

# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os


def pdfToText(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos = set()

    records = []
    i = 1
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                                  caching=caching, check_extractable=True):
        # process page
        interpreter.process_page(page)

        # only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
        lines = retstr.getvalue().splitlines()

        idx = containsSubString(lines, 'Tool')
        lines = lines[idx+1:]
        idx = containsSubString(lines, "1 The 'All'")
        lines = lines[:idx]

        for line in lines:
            records.append(line)
        i += 1

    fp.close()
    device.close()
    retstr.close()

    return records


def containsSubString(list, substring):
    # find a substring in a list item
    for i, s in enumerate(list):
        if substring in s:
            return i
    return -1


# process pdf
fn = '../test1.pdf'
ft = 'test.txt'

text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
    outFile.write(text[i])
outFile.close()

That produces a text file and it gets all of the text but, the x's don't have the spacing preserved. The output looks like this: enter image description here

The x's are just single spaced in the text document

Right now, I'm just producing text output but my goal is to produce an html document with the data from the tables. I've been searching for OCR examples, and most of them seem confusing or incomplete. I'm open to using C# or any other language that might produce the results I'm looking for.

EDIT: There will be multiple pdfs like this that I need to get the table data from. The headers will be the same for all pdfs (s far as I know).

454

asked Jan 13 '15 17:01

user

2 Answers

I figured it out, I was going in the wrong direction. What I did was create pngs of each table in the pdf and now I'm processing the images using opencv & python.

answered Sep 23 '22 03:09

user

Give a try to Tabula and if it works use tabula-extractor library (written in ruby) to programatically extract the data.

answered Sep 25 '22 03:09

matagus

Related questions
                            
                                Timing a task In Python [duplicate]
                            
                                Using Python requests, can I add "data" to a prepared request?
                            
                                Trailing equal signs (=) in emails
                            
                                How to get a value from every column in a Numpy matrix
                            
                                How to use latest openssl library with pyOpenSSL?
                            
                                GDAL reprojection error: in method 'Geometry_Transform', argument 2 of type 'OSRCoordinateTransformationShadow *'
                            
                                Accessing an ALREADY running process, with Python
                            
                                Celery is refusing to deserialize content of my custom serialization throwing ContentDisallowed Exception
                            
                                How can I iterate across the photos on my connected iPhone from Windows 7 in Python?
                            
                                Pandas OneHotEncoder.fit(dataframe) returns ValueError: invalid literal for long() with base 10
                            
                                Python AES decryption
                            
                                ipython using 2.6 version instead of 2.7
                            
                                Python unpickle a object with a class instance inside
                            
                                How to reduce a data with the longest string under pandas framework?
                            
                                Date formatting using python
                            
                                Django allauth Redirect after social signup
                            
                                Python + Hachoir-Metadata - Reading FPS tag from .MP4 file
                            
                                Bokeh - get information about points that have been selected
                            
                                Optimizing dict of set of tuple of ints with Numba?
                            
                                Mock superclass __init__ method or superclass as a whole for testing

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting tables from a pdf

Tags:

python

python-2.7

ocr

pdfminer

pdf-parsing

user

People also ask

2 Answers

user

matagus

Recent Activity

Donate For Us