Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use the Amazon Textract with PDF files

I already can use the textract but with JPEG files. I would like to use it with PDF files.

I have the code bellow:

import boto3

# Document
documentName = "Path to document in JPEG"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract')
documentText = ""

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        documentText = documentText + item["Text"]

        # print('\033[94m' +  item["Text"] + '\033[0m')
        # # print(item["Text"])

# removing the quotation marks from the string, otherwise would cause problems to A.I
documentText = documentText.replace(chr(34), '')
documentText = documentText.replace(chr(39), '')
print(documentText)

As I said, it works fine. But I would like to use it passing a PDF file as in the web application for tests.

I know it possible to convert the PDF to JPEG in python but it would be nice to do it with PDF. I read the documentation and do not find the answer.

How can I do that?

EDIT 1: I forgot to mention that I do not intend to use de s3 bucket. I want to pass the PDF right in the script, without having to upload it into s3 bucket.

like image 493
ArthurS Avatar asked Nov 25 '19 18:11

ArthurS


People also ask

Does Textract support PDF?

Q: What document formats does Amazon Textract support? Amazon Textract currently supports PNG, JPEG, TIFF, and PDF formats.

How do I make a PDF searchable?

Click on Tools > Text Recognition > In This File.The Recognize Text popup box opens. Select All pages, then click OK. The text recognition process will proceed page by page. Please note that for a very long document the process may take several minutes to complete.


Video Answer


1 Answers

As @syumaK mentioned, you need to upload the pdf to S3 first. However, doing this may be cheaper and easier than you think:

  • Create new S3 bucket in console and write down bucket name, then
import random
import boto3

bucket = 'YOUR_BUCKETNAME'
path = 'THE_PATH_FROM_WHERE_YOU_UPLOAD_INTO_S3'
filename = 'YOUR_FILENAME'

s3 = boto3.resource('s3')
print(f'uploading {filename} to s3')
s3.Bucket(bucket).upload_file(path+filename, filename)

client = boto3.client('textract')
response = client.start_document_text_detection(
                   DocumentLocation={'S3Object': {'Bucket': bucket, 'Name': filename} },
                   ClientRequestToken=random.randint(1,1e10))

jobid = response['JobId']
response = client.get_document_text_detection(JobId=jobid)

It may take 5-50 seconds, until the call to get_document_text_detection(...) returns a result. Before, it will say that it is still processing.

According to my understanding, for each token, exactly one paid API call will be performed - and a past one will be retrieved, if the token has appeared in the past.

Edit: I forgot to mention, that there is one intricacy if the document is large, in which case the result may need to be stitched together from multiple 'pages'. The kind of code you will need to add is


...
pages = [response]
while nextToken := response.get('NextToken'):
    response = client.get_document_text_detection(JobId=jobid, NextToken=nextToken)
    pages.append(response)
    
like image 169
tyrex Avatar answered Nov 08 '22 14:11

tyrex