Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rule based PDF text extraction for verious bills and invoices

I have to extract text from invoices and bills pdf files

The files layouts can get complex, though its mostly filled with tables.

I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.

Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.

Adobe also has an online service called exportPdf but it can't be customized

Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.

I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.

I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.

I'll be very grateful if someone with such experience give me a hint.

like image 900
Guy Gavriely Avatar asked Apr 17 '12 10:04

Guy Gavriely


People also ask

Can you extract text from a scanned PDF?

With optical character recognition (OCR) in Adobe Acrobat, you can extract text and convert scanned documents into editable, searchable PDF files instantly.


2 Answers

I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.

That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.

Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.

The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.

Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.

like image 200
Chris Haas Avatar answered Sep 28 '22 07:09

Chris Haas


It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?

Anyway here are some specialized tools including engines they use:

  1. Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
  2. Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
  3. Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)

DISCLAIMER: I work for ByteScout.

like image 26
Eugene Avatar answered Sep 28 '22 09:09

Eugene