Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Textract for OCR locally

Tags:

I want to extract text from images using Python. (Tessaract lib does not work for me because it requires installation).

I have found boto3 lib and Textract, but I'm having trouble working with it. I'm still new to this. Can you tell me what I need to do in order to run my script correctly.

This is my code:

import cv2
import boto3
import textract


#img = cv2.imread('slika2.jpg') #this is jpg file
with open('slika2.pdf', 'rb') as document:
    img = bytearray(document.read())

textract = boto3.client('textract',region_name='us-west-2')

response = textract.detect_document_text(Document={'Bytes': img}). #gives me error
print(response)

When I run this code, I get:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

I have also tried this:

# Document
documentName = "slika2.jpg"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Amazon Textract client
textract = boto3.client('textract',region_name='us-west-2')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes}) #ERROR

#print(response)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

But I get this error:

botocore.exceptions.ClientError: An error occurred (InvalidSignatureException) when calling the DetectDocumentText operation: The request signature we calculated does not match the signature you provided. Check your AWS Secret Access Key and signing method. Consult the service documentation for details.

Im noob in this, so any help would be good. How can I read text form my image or pdf file?

I have also added this block of code, but the error is still Unable to locate credentials.

session = boto3.Session(
    aws_access_key_id='xxxxxxxxxxxx',
    aws_secret_access_key='yyyyyyyyyyyyyyyyyyyyy'
)
like image 330
taga Avatar asked Sep 24 '20 10:09

taga


1 Answers

There is problem in passing credentials to boto3. You have to pass the credentials while creating boto3 client.

import boto3

# boto3 client
client = boto3.client(
    'textract', 
    region_name='us-west-2', 
    aws_access_key_id='xxxxxxx', 
    aws_secret_access_key='xxxxxxx'
)

# Read image
with open('slika2.png', 'rb') as document:
    img = bytearray(document.read())

# Call Amazon Textract
response = client.detect_document_text(
    Document={'Bytes': img}
)

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

Do note, it is not recommended to hardcode AWS Keys in code. Please refer following this document

https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/configuration.html

like image 106
Vipin Kumar Avatar answered Oct 16 '22 06:10

Vipin Kumar