There is a "Try" feature in AWS Textract page where we can upload Invoices in PDF, JPEG etc. But when I uploaded the PDF it wasn't working. Table's were not being shown, Form (Key-Pair values) were not being shown....nothing. But when I uploaded Invoice in JPEG it was working good. I didn't understand why.
I searched all over the internet but I couldn't find any solution. Some people even never heard of AWS Textract, even though I found its better than Google Document AI.
Please help!
You can use the amazon-textract-textractor
package to simplify calling and parsing Amazon Textract. Here is a link on a tutorial on how to use the AnalyzeExpense API. https://aws-samples.github.io/amazon-textract-textractor/notebooks/using_analyze_expense.html
If your pdf is a single-page, you can use the SYNC .analyze_expense
API like this:
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.analyze_expense(
file_source="invoice.pdf",
save_image=True,
)
document.visualize(with_words=False)
If your PDF document is multi-page, you need to use the ASYNC .start_expense_analysis
API.
You can do it like this:
from textractor import Textractor
extractor = Textractor(profile_name="default")
document = extractor.start_expense_analysis(
file_source="./multipage_invoice.pdf",
s3_upload_path="<YOUR S3 BUCKET>",
s3_output_path="<YOUR S3 BUCKET>",
save_image=True,
)
document.visualize(with_words=False)[0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With