Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Textract not working with Invoices in PDFs. Any advice?

There is a "Try" feature in AWS Textract page where we can upload Invoices in PDF, JPEG etc. But when I uploaded the PDF it wasn't working. Table's were not being shown, Form (Key-Pair values) were not being shown....nothing. But when I uploaded Invoice in JPEG it was working good. I didn't understand why.

I searched all over the internet but I couldn't find any solution. Some people even never heard of AWS Textract, even though I found its better than Google Document AI.

Please help!

like image 384
Tushar Avatar asked Sep 05 '25 03:09

Tushar


1 Answers

You can use the amazon-textract-textractor package to simplify calling and parsing Amazon Textract. Here is a link on a tutorial on how to use the AnalyzeExpense API. https://aws-samples.github.io/amazon-textract-textractor/notebooks/using_analyze_expense.html

If your pdf is a single-page, you can use the SYNC .analyze_expense API like this:

from textractor import Textractor

extractor = Textractor(profile_name="default")

document = extractor.analyze_expense(
    file_source="invoice.pdf",
    save_image=True,
)
document.visualize(with_words=False)

enter image description here

If your PDF document is multi-page, you need to use the ASYNC .start_expense_analysis API. You can do it like this:

from textractor import Textractor

extractor = Textractor(profile_name="default")

document = extractor.start_expense_analysis(
    file_source="./multipage_invoice.pdf",
    s3_upload_path="<YOUR S3 BUCKET>",
    s3_output_path="<YOUR S3 BUCKET>",
    save_image=True,
)
document.visualize(with_words=False)[0]
like image 114
Thomas Avatar answered Sep 07 '25 22:09

Thomas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!