Extracting data from Invoices in pdf or image format

Tags:

I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic solution for this.I have tried the following libraries

Invoice2Data : It is based on templates.It has given fairly good results in json format till now.But Template creation for complex pdfs containing dynamic table is complex.

Tabula : Table extraction is based on coordinates of the table to be extracted.If the data in the table increases the table length increases and hence the coordinates changes.So in this case it gives wrong results.

Pdftotext : It converts any pdfs to text but with the format that needs lots of parsing which we do not want.

Aws_Textract and Elis_Rossum_Ai : Gives all the data in json format.But if the table column contains multiple line then json parsing becomes difficult.Even the json given is huge in size to parse.

Tesseract : Same as pdftotext.Complex pdfs are not parseable.

Other than all this or with combination of the above libraries has anyone been able to parse complex pdf data please help.

476

asked May 23 '19 15:05

Rajesh Gosemath

Video Answer

1 Answers

I am working on a similar business problem. since invoices don't have fixed format so you can't directly use any text parsing method.

To solve this problem you have to use Computer Vision (Deep Learning) for field detection and Pytesseract OCR for converting image into text. For better understanding here are the steps:

Convert invoices to image and annotate the images with fields like address, Amount etc using tools like labelImg. (For better results use different types of 500-1000 invoices)
After Generating XML files train any object detection model like YOLO or TF object detection API.
The model will detect the fields and gives you coordinates of Region Of Interest(ROI). like
Apply Pytessract OCR on the ROI coordinates. Click Here
Finally, use regex to validate the text in the extracted field and perform any manipulation/transformation that is necessary. At last store data to CSV OR Database.

Hope my answer helps you! Upvote answer so it reaches to maximum people.

188

answered Oct 05 '22 23:10

Yashraj Nigam

Related questions
                            
                                Does a .NET CQL Parser Exist?
                            
                                Generating JavaScript from a JSLINT parse tree
                            
                                Is the order of reduction defined in Yacc?
                            
                                Using R to interpret a symbolic formula for outside use
                            
                                How do I parse this type of expressions?
                            
                                How is maximal-munch implemented?
                            
                                Which is the best way to parse JSON data in Android?
                            
                                Marpa: Can I explicitly disallow keywords as identifiers?
                            
                                Pandas read_fwf not Loading Entire Content of File
                            
                                How to prove left-recursive grammar is not in LL(1) using parsing table
                            
                                Are parsing expression grammars suited to parsing the shell command language?
                            
                                Parsing zero-based month strings in java
                            
                                How to write a parser for a mutually recursive ADT without recursion and side effects?
                            
                                How can I grab data series from xml or tcx file
                            
                                Parsing C# Conditional Compilation statements in roslyn
                            
                                Happy Context-Dependent Operator Precedence
                            
                                Insert a large csv file, 200'000 rows+, into MongoDB in NodeJS
                            
                                In Python, how to parse a string representing a set of keyword arguments such that the order does not matter
                            
                                Is every LL(1) grammar also a LALR(1) grammar?
                            
                                pandas.errors.ParserError: Error could possibly be due to quotes being ignored when a multi-char delimiter is used

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extracting data from Invoices in pdf or image format

Tags:

parsing

ocr

pdftotext

invoice

tabula

Rajesh Gosemath

People also ask

Video Answer

1 Answers

Yashraj Nigam

Recent Activity

Donate For Us