Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting data from Invoices in pdf or image format

I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic solution for this.I have tried the following libraries

Invoice2Data : It is based on templates.It has given fairly good results in json format till now.But Template creation for complex pdfs containing dynamic table is complex.

Tabula : Table extraction is based on coordinates of the table to be extracted.If the data in the table increases the table length increases and hence the coordinates changes.So in this case it gives wrong results.

Pdftotext : It converts any pdfs to text but with the format that needs lots of parsing which we do not want.

Aws_Textract and Elis_Rossum_Ai : Gives all the data in json format.But if the table column contains multiple line then json parsing becomes difficult.Even the json given is huge in size to parse.

Tesseract : Same as pdftotext.Complex pdfs are not parseable.

Other than all this or with combination of the above libraries has anyone been able to parse complex pdf data please help.

like image 476
Rajesh Gosemath Avatar asked May 23 '19 15:05

Rajesh Gosemath


People also ask

Should an invoice be a PDF?

Whether it be to bill a client or to approve and process an invoice you've received, the PDF format is your best bet. You can build your invoice template directly within your PDF application.

What scraping method is used to extract data from images PDFs?

To scrape text from scanned PDFs, ReportMiner offers you OCR functionality that can help you convert images into text formats. Once the image-based PDF is converted to text, you can scrape the text from it similar to text-based PDFs (using extraction templates).

Can you pull data from a PDF?

You can extract data from PDF files directly into Excel. First, you'll need to import your PDF file. Once you import the file, use the extract data button to begin the extraction process. You should see several instruction windows that will help you extract the selected data.


Video Answer


1 Answers

I am working on a similar business problem. since invoices don't have fixed format so you can't directly use any text parsing method.

To solve this problem you have to use Computer Vision (Deep Learning) for field detection and Pytesseract OCR for converting image into text. For better understanding here are the steps:

  1. Convert invoices to image and annotate the images with fields like address, Amount etc using tools like labelImg. (For better results use different types of 500-1000 invoices)

  2. After Generating XML files train any object detection model like YOLO or TF object detection API.

  3. The model will detect the fields and gives you coordinates of Region Of Interest(ROI). like Example Invoice

  4. Apply Pytessract OCR on the ROI coordinates. Click Here

  5. Finally, use regex to validate the text in the extracted field and perform any manipulation/transformation that is necessary. At last store data to CSV OR Database.

Hope my answer helps you! Upvote answer so it reaches to maximum people.

like image 188
Yashraj Nigam Avatar answered Oct 05 '22 23:10

Yashraj Nigam