How to use the Amazon Textract with PDF files

Tags:

I already can use the textract but with JPEG files. I would like to use it with PDF files.

I have the code bellow:

import boto3

# Document
documentName = "Path to document in JPEG"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract')
documentText = ""

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        documentText = documentText + item["Text"]

        # print('\033[94m' +  item["Text"] + '\033[0m')
        # # print(item["Text"])

# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)

As I said, it works fine. But I would like to use it passing a PDF file as in the web application for tests.

I know it possible to convert the PDF to JPEG in python but it would be nice to do it with PDF. I read the documentation and do not find the answer.

How can I do that?

EDIT 1: I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket.

493

asked Nov 25 '19 18:11

ArthurS

Video Answer

1 Answers

As @syumaK mentioned, you need to upload the pdf to S3 first. However, doing this may be cheaper and easier than you think:

Create new S3 bucket in console and write down bucket name, then

import random
import boto3

bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'

s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)

client = boto3.client('textract')
response = client.start_document_text_detection(
                   DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
                   ClientRequestToken=random.randint(1,1e10))

jobid = response['JobId']
response = client.get_document_text_detection(JobId=jobid)

It may take 5-50 seconds, until the call to get_document_text_detection(...) returns a result. Before, it will say that it is still processing.

According to my understanding, for each token, exactly one paid API call will be performed - and a past one will be retrieved, if the token has appeared in the past.

Edit: I forgot to mention, that there is one intricacy if the document is large, in which case the result may need to be stitched together from multiple 'pages'. The kind of code you will need to add is


...
pages = [response]
while nextToken := response.get('NextToken'):
    response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
    pages.append(response)

169

answered Nov 08 '22 14:11

tyrex

Related questions
                            
                                Terraform AWS S3 to Lambda Notification Trigger
                            
                                AWS API Gateway Custom Authorizer not invoked
                            
                                an internal error occurred during: uploading code to lambda
                            
                                Identifying and deleting S3 Objects that are not being accessed?
                            
                                AWS CloudFormation Script Fails - Cognito is not allowed to use your email identity
                            
                                aws CAPABILITY_AUTO_EXPAND console web codepipeline with cloudformation
                            
                                Django on AWS Elastic Beanstalk - No module named MySQLdb Error
                            
                                AWS Step cannot correctly invoke AWS Batch job with complex parameters
                            
                                Kubernetes Kops without dns
                            
                                AWS::ApiGateway::Stage requires DeploymentId ... but where do I find this?
                            
                                How to run python code on AWS lambda with package dependencies >500MB?
                            
                                AWS RDS IAM Authentication with Terraform
                            
                                AWS Sagemaker Ground Truth WorkerID for private team
                            
                                AWS update Athena meta: Glue Crawler vs MSCK Repair Table
                            
                                Creating presigned url for a S3 folder in python
                            
                                Jenkins suddenly started failing to provision agents in Amazon EKS
                            
                                Storing many small files (on S3)?
                            
                                how to run a pre-trained model in AWS sagemaker?
                            
                                Pulumi: How to serialize Output<string>[] to JSON
                            
                                How can I bypass the 10MB limit of AWS API gateway and POST large files to AWS lambda?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use the Amazon Textract with PDF files

Tags:

amazon-web-services

text-extraction

ocr

amazon-textract

ArthurS

People also ask

Video Answer

1 Answers

tyrex

Recent Activity

Donate For Us