I am working on invoice parser which extracts data from invoices in pdf or image format.It works on simple pdf with non tabular data but gives lots of output data to process with pdf which contains tables.I am not able to get a working generic solution for this.I have tried the following libraries
Invoice2Data : It is based on templates.It has given fairly good results in json format till now.But Template creation for complex pdfs containing dynamic table is complex.
Tabula : Table extraction is based on coordinates of the table to be extracted.If the data in the table increases the table length increases and hence the coordinates changes.So in this case it gives wrong results.
Pdftotext : It converts any pdfs to text but with the format that needs lots of parsing which we do not want.
Aws_Textract and Elis_Rossum_Ai : Gives all the data in json format.But if the table column contains multiple line then json parsing becomes difficult.Even the json given is huge in size to parse.
Tesseract : Same as pdftotext.Complex pdfs are not parseable.
Other than all this or with combination of the above libraries has anyone been able to parse complex pdf data please help.
Whether it be to bill a client or to approve and process an invoice you've received, the PDF format is your best bet. You can build your invoice template directly within your PDF application.
To scrape text from scanned PDFs, ReportMiner offers you OCR functionality that can help you convert images into text formats. Once the image-based PDF is converted to text, you can scrape the text from it similar to text-based PDFs (using extraction templates).
You can extract data from PDF files directly into Excel. First, you'll need to import your PDF file. Once you import the file, use the extract data button to begin the extraction process. You should see several instruction windows that will help you extract the selected data.
I am working on a similar business problem. since invoices don't have fixed format so you can't directly use any text parsing method.
To solve this problem you have to use Computer Vision (Deep Learning) for field detection and Pytesseract OCR for converting image into text. For better understanding here are the steps:
Convert invoices to image and annotate the images with fields like address, Amount etc using tools like labelImg. (For better results use different types of 500-1000 invoices)
After Generating XML files train any object detection model like YOLO or TF object detection API.
The model will detect the fields and gives you coordinates of Region Of Interest(ROI). like
Apply Pytessract OCR on the ROI coordinates. Click Here
Finally, use regex to validate the text in the extracted field and perform any manipulation/transformation that is necessary. At last store data to CSV OR Database.
Hope my answer helps you! Upvote answer so it reaches to maximum people.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With