Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract Text Using PdfMiner and PyPDF2 Merges columns

I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link.

PDF File

I am good with any type of output (file/string). Here is the code which returns the extracted text as string for me but for some reason, columns are merged.

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
import StringIO

def convert_pdf(filename):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec)

    fp = file(filename, 'rb')
    process_pdf(rsrcmgr, device, fp)
    fp.close()
    device.close()

    str = retstr.getvalue()
    retstr.close()
    return str

I have also tried PyPdf2, but faced the same issue. Here is the sample code for PyPDF2

from PyPDF2.pdf import PdfFileReader
import StringIO
import time

def getDataUsingPyPdf2(filename):
    pdf = PdfFileReader(open(filename, "rb"))
    content = ""

    for i in range(0, pdf.getNumPages()):
        print str(i)
        extractedText = pdf.getPage(i).extractText()
        content +=  extractedText + "\n"

    content = " ".join(content.replace("\xa0", " ").strip().split())
    return content.encode("ascii", "ignore")

I have also tried pdf2txt.py but unable to get the formatted output.

like image 902
user2151334 Avatar asked Apr 01 '13 04:04

user2151334


People also ask

Can Python extract text from pdf?

Extracting Text from PDF FilePython package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.

How extract specific data from PDF in Python?

There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.

How do you take words out of a PDF?

If you have Acrobat Reader, you can copy a portion of a PDF file to the clipboard and paste it into another program. For text, just highlight the portion of text in the PDF and press Ctrl + C to copy it. Then open a word processing program, such as Microsoft Word, and press Ctrl + V to paste the text.


2 Answers

I recently struggled with a similar problem, although my pdf had slightly simpler structure.

PDFMiner uses classes called "devices" to parse the pages in a pdf fil. The basic device class is the PDFPageAggregator class, which simply parses the text boxes in the file. The converter classes , e.g. TextConverter, XMLConverter, and HTMLConverter also output the result in a file (or in a string stream as in your example) and do some more elaborate parsing for the contents.

The problem with TextConverter (and PDFPageAggregator) is that they don't recurse deep enough to the structure of the document to properly extract the different columns. The two other converters require some information about the structure of the document for display purposes, so they gather more detailed data. In your example pdf both of the simplistic devices only parse (roughly) the entire text box containing the columns, which makes it impossible (or at least very difficult) to correctly separate the different rows. The solution to this that I found works pretty well, is to either

  • Create a new class that inherits from PDFPageAggregator, or
  • Use XMLConverter and parse the resulting XML document using e.g. Beautifulsoup

In both cases you would have to combine the different text segments to rows using their bounding box y-coordinates.

In the case of a new device class ('tis more eloquent, I think) you would have to override the method receive_layout that get's called for each page during the rendering process. This method then recursively parses the elements in each page. For example, something like this might get you started:

from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTPage, LTChar, LTAnno, LAParams, LTTextBox, LTTextLine

class PDFPageDetailedAggregator(PDFPageAggregator):
    def __init__(self, rsrcmgr, pageno=1, laparams=None):
        PDFPageAggregator.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)
        self.rows = []
        self.page_number = 0
    def receive_layout(self, ltpage):        
        def render(item, page_number):
            if isinstance(item, LTPage) or isinstance(item, LTTextBox):
                for child in item:
                    render(child, page_number)
            elif isinstance(item, LTTextLine):
                child_str = ''
                for child in item:
                    if isinstance(child, (LTChar, LTAnno)):
                        child_str += child.get_text()
                child_str = ' '.join(child_str.split()).strip()
                if child_str:
                    row = (page_number, item.bbox[0], item.bbox[1], item.bbox[2], item.bbox[3], child_str) # bbox == (x1, y1, x2, y2)
                    self.rows.append(row)
                for child in item:
                    render(child, page_number)
            return
        render(ltpage, self.page_number)
        self.page_number += 1
        self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2]))
        self.result = ltpage

In the code above, each found LTTextLine element is stored in an ordered list of tuples containing the page number, coordinates of the bounding box, and the text contained in that particular element. You would then do something similar to this:

from pprint import pprint
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams

fp = open('pdf_doc.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
doc.initialize('password') # leave empty for no password

rsrcmgr = PDFResourceManager()
laparams = LAParams()
device = PDFPageDetailedAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)

for page in PDFPage.create_pages(doc):
    interpreter.process_page(page)
    # receive the LTPage object for this page
    device.get_result()

pprint(device.rows)

The variable device.rows contains the ordered list with all the text lines arranged using their page number and y-coordinates. You can loop over the text lines and group lines with the same y-coordinates to form the rows, store the column data etc.

I tried to parse your pdf using the above code and the columns are mostly parsed correctly. However, some of the columns are so close together that the default PDFMiner heuristics fail to separate them into their own elements. You can probably get around this by tweaking the word margin parameter (the -W flag in the command line tool pdf2text.py). In any case, you might want to read through the (poorly documented) PDFMiner API as well as browse through the source code of PDFMiner, which you can obtain from github. (Alas, I cannot paste the link because I do not have sufficient rep points :'<, but you can hopefully google the correct repo)

like image 74
lindblandro Avatar answered Oct 05 '22 19:10

lindblandro


I tried your first block of code and got a bunch of results that look like this:

MULTIPLE DWELLING AGARDEN COMPLEX 14945010314370 TO 372WILLOWRD W MULTIPLE DWELLING AGARDEN COMPLEX 14945010314380 TO 384WILLOWRD W MULTIPLE DWELLING AGARDEN COMPLEX 149450103141000 TO 1020WILLOWBROOKRD MULTIPLE DWELLING AROOMING HOUSE 198787

I am guessing you are in a similar position as this answer and that all the whitespace is used to position the words in the proper place, not as actual printable space characters. The fact that you have tried with with other pdf libraries makes me think that this might be an issue that is difficult for any pdf library to parse.

like image 35
Stedy Avatar answered Oct 05 '22 18:10

Stedy