Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting tables from a pdf

I'm trying to get the data from the tables in this PDF. I've tried pdfminer and pypdf with a little luck but I can't really get the data from the tables.

This is what one of the tables looks like: enter image description here

As you can see, some columns are marked with an 'x'. I'm trying to this table into a list of objects.

This is the code so far, I'm using pdfminer now.

# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os


def pdfToText(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ''
    maxpages = 0
    caching = True
    pagenos = set()

    records = []
    i = 1
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
                                  caching=caching, check_extractable=True):
        # process page
        interpreter.process_page(page)

        # only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
        lines = retstr.getvalue().splitlines()

        idx = containsSubString(lines, 'Tool')
        lines = lines[idx+1:]
        idx = containsSubString(lines, "1 The 'All'")
        lines = lines[:idx]

        for line in lines:
            records.append(line)
        i += 1

    fp.close()
    device.close()
    retstr.close()

    return records


def containsSubString(list, substring):
    # find a substring in a list item
    for i, s in enumerate(list):
        if substring in s:
            return i
    return -1


# process pdf
fn = '../test1.pdf'
ft = 'test.txt'

text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
    outFile.write(text[i])
outFile.close()

That produces a text file and it gets all of the text but, the x's don't have the spacing preserved. The output looks like this: enter image description here

The x's are just single spaced in the text document

Right now, I'm just producing text output but my goal is to produce an html document with the data from the tables. I've been searching for OCR examples, and most of them seem confusing or incomplete. I'm open to using C# or any other language that might produce the results I'm looking for.

EDIT: There will be multiple pdfs like this that I need to get the table data from. The headers will be the same for all pdfs (s far as I know).

like image 454
user Avatar asked Jan 13 '15 17:01

user


People also ask

How to extract data from a PDF file?

Step 1: Open the PDF file In Adobe Acrobat Pro DC > File > Open Step 2: Locate the table from which you want to extract data and drag a selection over the table as shown below Step 3: Right-click and select “Export Selection As…”

How to extract data from a table in Adobe Acrobat Pro DC?

In Adobe Acrobat Pro DC > File > Open Step 2: Locate the table from which you want to extract data and drag a selection over the table as shown below Step 3: Right-click and select “Export Selection As…” Step 4: Choose the export type

How to extract tables from PDF online using hipdf?

The following steps illustrate how to extract tables from PDF online using HiPDF. Step 1. Visit HiPDF using your browser and select the “PDF to Excel” option under the “Convert from PDF” section. Step 2. Click on the “CHOOSE FILE” button and browse the target PDF file. Select the PDF and click “Open” to upload it.

How to import and export table data from PDF to excel?

Step 1: Launch the Adobe Acrobat Pro DC app. Step 2: On the app, select Open a File. Step 3: Using the file library select the PDF file to be imported to the app. Step 4: Once the file opens, use the cursor to highlight the table data. Step 5: Right-click on the highlighted area and select Export Selection As.


2 Answers

I figured it out, I was going in the wrong direction. What I did was create pngs of each table in the pdf and now I'm processing the images using opencv & python.

like image 51
user Avatar answered Sep 23 '22 03:09

user


Give a try to Tabula and if it works use tabula-extractor library (written in ruby) to programatically extract the data.

like image 21
matagus Avatar answered Sep 25 '22 03:09

matagus