Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unsupported Document format while using Amazon Textract,

When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported document format.

i am using Amazon textract with boto3. When i try to parse pdf file accessed via amazon s3, it gives me an error, Request has unsupported do cument format. I am fairly new to this, in the documentation of textract it is mentioned that pdf files are indeed supported.

This is the code i am using.

import boto3
textractClient = boto3.client('textract',region_name='us-east-1')
response = textractClient.detect_document_text(
        Document={'S3Object': {'Bucket': 'bucketName', 'Name': 'filename.pdf'}})
blocks = response['Blocks']

This gives me the error,Request has unsupported document format.

like image 837
Jung Thapa Avatar asked Jul 18 '19 07:07

Jung Thapa


People also ask

Does Textract support PDF?

Q: What document formats does Amazon Textract support? Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats.

Can Textract be trained?

It is not possible to 'train' Amazon Textract. The available actions are limited to analysing a document and detecting text.

How Fast Is Amazon Textract?

You can quickly automate document processing and act on the information extracted, whether you're automating loans processing or extracting information from invoices and receipts. Textract can extract the data in minutes instead of hours or days.


1 Answers

detect_document_text() is a synchronous API that only support PNG or JPG images.

If you'd like to process PDF files, you should use the asynchronous API called start_document_text_detection().

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.start_document_text_detection

like image 98
Julien Simon Avatar answered Sep 22 '22 00:09

Julien Simon