Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tesseract-ocr works on EC2, not lambda

My goal is to run tesseract-ocr in AWS Lambda.

I've built an EC2 instance that attempts to mirror the Lambda environment. Executing tesseract without parameters succeeds in both environments. However, any attempt at substantive image processing, e.g. this code:

tess = child_process.exec('tesseract input.tif output -l eng -psm 1 hocr', function(error, stdout, stderr) {
...

runs successfully on my EC2 box, but fails in Lambda with this error:

Error: Command failed: Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Error during processing.

 at ChildProcess.exithandler (child_process.js:648:15)
 at ChildProcess.emit (events.js:98:17)
 at maybeClose (child_process.js:756:16)
 at Process.ChildProcess._handle.onexit (child_process.js:823:5)
Error code: 1
Signal received: null

Lambda is assuming an IAM role with administrative privileges ({ "Effect": "Allow", "Action": "", "Resource": "" })

The "Error during processing" error is emitted by tesseract as a top level catch-all. I'm going to instrument tesseract and try to narrow the problem further.

How I got here:

  • My EC2 machine is a t2.micro running Amazon Linux in us-east-1 (amzn-ami-hvm-2014.09.2.x86_64-ebs (ami-146e2a7c)).
  • I installed node 0.10.33 and [email protected], which match the Lambda versions.
  • I compiled tesseract and leptonica from source. Added an rpath and have run ldd to confirm that all dependencies are found
  • tesseract binaries and liblept.so are all in my root directory (/var/task)

I'd like to know what's going wrong - or how to diagnose it.

Thank you, Dave

like image 247
user1144380 Avatar asked Oct 20 '22 15:10

user1144380


1 Answers

Short answer: output must go in the /tmp dir, e.g.

tesseract input.tif /tmp/output -l eng -psm 1 hocr

Slightly longer answer: tesseract calls fopen wb under the hood, and apparently that is forbidden in /var/task.

I would have noticed this a few days ago, but Lambda has not been propagating my deployment packages. So, the one time I tried to put output in the /tmp dir, there was no effect - but that was b/c Lambda was executing a stale version of my function. Solution is to always delete-function before calling update-function.

like image 145
user1144380 Avatar answered Oct 28 '22 14:10

user1144380