I have spent all week attempting this, so this is a bit of a hail mary.
I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python).
I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (e.g. /env/)
pip install py-tesseract
results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseractpip install tesseract-ocr
gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency. However, I have no idea how to package up leptonica into /env/ (if that is even possible)tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found #include "leptonica/allheaders.h"
Processing dependencies for python-tesseract==0.9.1 Searching for python-tesseract==0.9.1 Reading https://pypi.python.org/simple/python-tesseract/ Couldn't find index page for 'python-tesseract' (maybe misspelled?) Scanning index of all packages (this may take a while) Reading https://pypi.python.org/simple/ No local packages or download links found for python-tesseract==0.9.1
Any pointers would be greatly appreciated.
Also the code is able to detect text, its just extremely slow. Recognising text from images is very cpu intensive - as a first step I would look at binarizing the input that is passed into image_to_string - this can speed up text recognition significantly.
Pytesseract or Python-tesseract is an OCR tool for python that also serves as a wrapper for the Tesseract-OCR Engine. It can read and recognize text in images and is commonly used in python ocr image to text use cases.
The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.
1) Start an EC2 instance with 64-bit Amazon Linux;
2) Install dependencies:
sudo yum install gcc gcc-c++ make
sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel
3) Compile and install leptonica:
cd ~
mkdir leptonica
cd leptonica
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
sudo make install
4) Compile and install tesseract
cd ~
mkdir tesseract
cd tesseract
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd tesseract-3.04.01
./autogen.sh
./configure
make
sudo make install
5) Download language traineddata to tessdata
cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/
At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.
6) Zip all the stuff you need to run tesseract on lambda
cd ~
mkdir tesseract-lambda
cd tesseract-lambda
cp /usr/local/bin/tesseract .
mkdir lib
cd lib
cp /usr/local/lib/libtesseract.so.3 .
cp /usr/local/lib/liblept.so.5 .
cp /usr/lib64/libpng12.so.0 .
cd ..
mkdir tessdata
cd tessdata
cp /usr/local/share/tessdata/eng.traineddata .
cd ..
cd ..
zip -r tesseract-lambda.zip tesseract-lambda
The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.
7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip:
from __future__ import print_function
import urllib
import boto3
import os
import subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
s3 = boto3.client('s3')
def lambda_handler(event, context):
# Get the bucket and object from the event
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
try:
print("Bucket: " + bucket)
print("Key: " + key)
imgfilepath = '/tmp/image.png'
jsonfilepath = '/tmp/result.txt'
exportfile = key + '.txt'
print("Export: " + exportfile)
s3.download_file(bucket, key, imgfilepath)
command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
LIB_DIR,
SCRIPT_DIR,
SCRIPT_DIR,
imgfilepath,
jsonfilepath,
)
try:
output = subprocess.check_output(command, shell=True)
print(output)
s3.upload_file(jsonfilepath, bucket, exportfile)
except subprocess.CalledProcessError as e:
print(e.output)
except Exception as e:
print(e)
print('Error processing object {} from bucket {}.'.format(key, bucket))
raise e
When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.
IMPORTANT
From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).
Thanks for the comment, SergioArcos.
Adapatations for tesseract 4:
Tesseract offers much improvements in version 4, thanks to a neural network. I've tried it with some scans and the improvements are quite substantial. Plus the whole package was 25% smaller in my case. Planned release date of version 4 is first half of 2018.
The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full. I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step.
To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance:
cd ~
sudo yum install clang -y
sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y
wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz
tar -xzvf leptonica-1.75.1.tar.gz
cd leptonica-1.75.1
./configure && make && sudo make install
Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own:
cd ~
wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz
tar -xvf autoconf-archive-2017.09.28.tar.xz
cd autoconf-archive-2017.09.28
./configure && make && sudo make install
sudo cp m4/* /usr/share/aclocal/
cd ~
sudo yum install git-core libtool pkgconfig -y
git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
cd tesseract-ocr
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
sudo make install
cd ~
mkdir tesseract-standalone
cd tesseract-standalone
cp /usr/local/bin/tesseract .
mkdir lib
cp /usr/local/lib/libtesseract.so.4 lib/
cp /usr/local/lib/liblept.so.5 lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/
mkdir tessdata
cd tessdata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
# additionally any other language you want to use, e.g. `deu` for Deutsch
mkdir configs
cp /usr/local/share/tessdata/configs/pdf configs/
cp /usr/local/share/tessdata/pdf.ttf .
cd ..
zip -r ~/tesseract-standalone.zip *
I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function. Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2! This process takes less than 20 minutes using these steps and is reliably reproducible.
Summarized Steps:
Start an Amazon Linux EC2 instance (t2 micro will do just fine)
sudo yum update
sudo yum install git-core -y
sudo yum install docker -y
sudo service docker start
sudo usermod -a -G docker ec2-user #allows ec2-user to call docker
After running the 5th command you will need to logout and log back in for the change to take effect.
git clone https://github.com/amtam0/lambda-tesseract-api.git
cd lambda-tesseract-api/
bash build_tesseract4.sh #takes a few minutes
bash build_py37_pkgs.sh
This will generate .zip files for tesseract, pytesseract, pillow, and opencv. In order to use with lambda you need to complete two more steps.
(Note: you will probably need to increase your Memory allocation and Timeout)
At this point you are all set to upload your code and start using Tesseract on AWS Lambda! Refer back to the Medium article for a test script.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With