I have to extract text from invoices and bills pdf files The files layouts can get complex, though its mostly filled with tables. I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure. Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step. Adobe also has an online service called exportPdf but it can't be customized Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work. I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows. I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it. I'll be very grateful if someone with such experience give me a hint.

It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer? Anyway here are some specialized tools including engines they use: <ol> <li> Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)</li> <li> Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)</li> <li> Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)</li> </ol> DISCLAIMER: I work for ByteScout.

Rule based PDF text extraction for verious bills and invoices

2 Answers

I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.

That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.

Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.

The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.

Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.

200

answered Sep 28 '22 07:09

Chris Haas

It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?

Anyway here are some specialized tools including engines they use:

Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)

DISCLAIMER: I work for ByteScout.

answered Sep 28 '22 09:09

Eugene

Related questions
                            
                                Create destinations for all bookmarks in a PDF file with iText API
                            
                                iTextSharp for PDF - how add file attachments?
                            
                                How to display a PDF document in a Microsoft Surface application?
                            
                                How to convert PDF to WORD in c# [closed]
                            
                                Plot as bitmap in PDF
                            
                                Start Activity from Adobe AIR native extension for Android
                            
                                It is possible to display pdf received bytes from service in web view in Android
                            
                                CMYK images turn negative with TCPDF
                            
                                Excel automation: PDF export causes "Printer setup" popup
                            
                                sphinx remove chapter title in my pdf with latexpdf?
                            
                                iText pdf not displaying Chinese characters when using NOTO fonts or Source Hans
                            
                                Generate pdf file dynamically from html template and produce table of contents in java
                            
                                Does wkHTMLtoPDF support @page rules?
                            
                                Is there a public website that converts swagger json to PDF for HTML?
                            
                                How can I render html content as pdf file using phantomJs in node.JS
                            
                                jsPDF auto table wide column content not breaking
                            
                                Fonts for Rmarkdown document
                            
                                What is the best way to parse Microsoft Office and PDF documents?
                            
                                Issue displaying PDF figures created with R on iOS devices
                            
                                How do I convert a PDF to text so I can parse that text with PHP?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Rule based PDF text extraction for verious bills and invoices

Tags:

pdf

text-extraction

Guy Gavriely

People also ask

2 Answers

Chris Haas

Eugene

Recent Activity

Donate For Us