I have spent all week attempting this, so this is a bit of a hail mary. I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python). I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (e.g. /env/) <ul> <li>Doing <code>pip install py-tesseract</code> results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseract</li> <li>Doing <code>pip install tesseract-ocr</code> gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency. However, I have no idea how to package up leptonica into /env/ (if that is even possible)</li> </ul> <blockquote> <pre class="prettyprint"><code>tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found #include "leptonica/allheaders.h" </code></pre> </blockquote> <ul> <li>Downloading 0.9.1 python-tesseract egg file from https://bitbucket.org/3togo/python-tesseract/downloads and doing easy_install also errors out when looking for dependencies</li> </ul> <blockquote> <pre class="prettyprint"><code>Processing dependencies for python-tesseract==0.9.1 Searching for python-tesseract==0.9.1 Reading https://pypi.python.org/simple/python-tesseract/ Couldn't find index page for 'python-tesseract' (maybe misspelled?) Scanning index of all packages (this may take a while) Reading https://pypi.python.org/simple/ No local packages or download links found for python-tesseract==0.9.1 </code></pre> </blockquote> Any pointers would be greatly appreciated.

The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function. 1) Start an EC2 instance with 64-bit Amazon Linux; 2) Install dependencies: <pre class="prettyprint"><code>sudo yum install gcc gcc-c++ make sudo yum install autoconf aclocal automake sudo yum install libtool sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel </code></pre> 3) Compile and install leptonica: <pre class="prettyprint"><code>cd ~ mkdir leptonica cd leptonica wget http://www.leptonica.com/source/leptonica-1.73.tar.gz tar -zxvf leptonica-1.73.tar.gz cd leptonica-1.73 ./configure make sudo make install </code></pre> 4) Compile and install tesseract <pre class="prettyprint"><code>cd ~ mkdir tesseract cd tesseract wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz tar -zxvf 3.04.01.tar.gz cd tesseract-3.04.01 ./autogen.sh ./configure make sudo make install </code></pre> 5) Download language traineddata to tessdata <pre class="prettyprint"><code>cd /usr/local/share/tessdata wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata export TESSDATA_PREFIX=/usr/local/share/ </code></pre> At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need. 6) Zip all the stuff you need to run tesseract on lambda <pre class="prettyprint"><code>cd ~ mkdir tesseract-lambda cd tesseract-lambda cp /usr/local/bin/tesseract . mkdir lib cd lib cp /usr/local/lib/libtesseract.so.3 . cp /usr/local/lib/liblept.so.5 . cp /usr/lib64/libpng12.so.0 . cd .. mkdir tessdata cd tessdata cp /usr/local/share/tessdata/eng.traineddata . cd .. cd .. zip -r tesseract-lambda.zip tesseract-lambda </code></pre> The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work. 7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip: <pre class="prettyprint"><code>from __future__ import print_function import urllib import boto3 import os import subprocess SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__)) LIB_DIR = os.path.join(SCRIPT_DIR, 'lib') s3 = boto3.client('s3') def lambda_handler(event, context): # Get the bucket and object from the event bucket = event['Records'][0]['s3']['bucket']['name'] key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8') try: print("Bucket: " + bucket) print("Key: " + key) imgfilepath = '/tmp/image.png' jsonfilepath = '/tmp/result.txt' exportfile = key + '.txt' print("Export: " + exportfile) s3.download_file(bucket, key, imgfilepath) command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format( LIB_DIR, SCRIPT_DIR, SCRIPT_DIR, imgfilepath, jsonfilepath, ) try: output = subprocess.check_output(command, shell=True) print(output) s3.upload_file(jsonfilepath, bucket, exportfile) except subprocess.CalledProcessError as e: print(e.output) except Exception as e: print(e) print('Error processing object {} from bucket {}.'.format(key, bucket)) raise e </code></pre> When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler. IMPORTANT From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0). Thanks for the comment, SergioArcos.

Adapatations for tesseract 4: Tesseract offers much improvements in version 4, thanks to a neural network. I've tried it with some scans and the improvements are quite substantial. Plus the whole package was 25% smaller in my case. Planned release date of version 4 is first half of 2018. The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full. I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step. To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance: <h3>Compile leptonica</h3> <pre class="prettyprint"><code>cd ~ sudo yum install clang -y sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz tar -xzvf leptonica-1.75.1.tar.gz cd leptonica-1.75.1 ./configure && make && sudo make install </code></pre> <h3>Compile autoconf-archive</h3> Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own: <pre class="prettyprint"><code>cd ~ wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz tar -xvf autoconf-archive-2017.09.28.tar.xz cd autoconf-archive-2017.09.28 ./configure && make && sudo make install sudo cp m4/* /usr/share/aclocal/ </code></pre> <h3>Compile tesseract</h3> <pre class="prettyprint"><code>cd ~ sudo yum install git-core libtool pkgconfig -y git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git tesseract-ocr cd tesseract-ocr export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig ./autogen.sh ./configure make sudo make install </code></pre> <h3>Get all needed files and zip</h3> <pre class="prettyprint"><code>cd ~ mkdir tesseract-standalone cd tesseract-standalone cp /usr/local/bin/tesseract . mkdir lib cp /usr/local/lib/libtesseract.so.4 lib/ cp /usr/local/lib/liblept.so.5 lib/ cp /usr/lib64/libjpeg.so.62 lib/ cp /usr/lib64/libwebp.so.4 lib/ cp /usr/lib64/libstdc++.so.6 lib/ mkdir tessdata cd tessdata wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata # additionally any other language you want to use, e.g. `deu` for Deutsch mkdir configs cp /usr/local/share/tessdata/configs/pdf configs/ cp /usr/local/share/tessdata/pdf.ttf . cd .. zip -r ~/tesseract-standalone.zip * </code></pre>

<h3>Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7</h3> I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function. Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2! This process takes less than 20 minutes using these steps and is reliably reproducible. Summarized Steps: Start an Amazon Linux EC2 instance (t2 micro will do just fine) <pre class="prettyprint"><code>sudo yum update sudo yum install git-core -y sudo yum install docker -y sudo service docker start sudo usermod -a -G docker ec2-user #allows ec2-user to call docker </code></pre> After running the 5th command you will need to logout and log back in for the change to take effect. <pre class="prettyprint"><code>git clone https://github.com/amtam0/lambda-tesseract-api.git cd lambda-tesseract-api/ bash build_tesseract4.sh #takes a few minutes bash build_py37_pkgs.sh </code></pre> This will generate .zip files for tesseract, pytesseract, pillow, and opencv. In order to use with lambda you need to complete two more steps. <ol> <li>Create Lambda layers, one for each zip file, and attach the layers to your Lambda function.</li> <li>Create an Environment Variable. Key : PYTHONPATH and Value : /opt/</li> </ol> (Note: you will probably need to increase your Memory allocation and Timeout) At this point you are all set to upload your code and start using Tesseract on AWS Lambda! Refer back to the Medium article for a test script.

Tesseract OCR on AWS Lambda via virtualenv

Tags:

python

virtualenv

amazon-web-services

aws-lambda

tesseract

I have spent all week attempting this, so this is a bit of a hail mary.

I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python).

I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (e.g. /env/)

Doing pip install py-tesseract results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseract
Doing pip install tesseract-ocr gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency. However, I have no idea how to package up leptonica into /env/ (if that is even possible)

tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found
#include "leptonica/allheaders.h"

Downloading 0.9.1 python-tesseract egg file from https://bitbucket.org/3togo/python-tesseract/downloads and doing easy_install also errors out when looking for dependencies

Processing dependencies for python-tesseract==0.9.1
Searching for python-tesseract==0.9.1
Reading https://pypi.python.org/simple/python-tesseract/
Couldn't find index page for 'python-tesseract' (maybe misspelled?)
Scanning index of all packages (this may take a while)
Reading https://pypi.python.org/simple/
No local packages or download links found for python-tesseract==0.9.1

Any pointers would be greatly appreciated.

230

asked Nov 07 '15 21:11

Andy G

3 Answers

The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.

1) Start an EC2 instance with 64-bit Amazon Linux;

2) Install dependencies:

sudo yum install gcc gcc-c++ make
sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel

3) Compile and install leptonica:

cd ~
mkdir leptonica
cd leptonica
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
sudo make install

4) Compile and install tesseract

cd ~
mkdir tesseract
cd tesseract
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd tesseract-3.04.01
./autogen.sh
./configure
make
sudo make install

5) Download language traineddata to tessdata

cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/

At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.

6) Zip all the stuff you need to run tesseract on lambda

cd ~
mkdir tesseract-lambda
cd tesseract-lambda
cp /usr/local/bin/tesseract .
mkdir lib
cd lib
cp /usr/local/lib/libtesseract.so.3 .
cp /usr/local/lib/liblept.so.5 .
cp /usr/lib64/libpng12.so.0 .
cd ..

mkdir tessdata
cd tessdata
cp /usr/local/share/tessdata/eng.traineddata .
cd ..

cd ..
zip -r tesseract-lambda.zip tesseract-lambda

The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.

7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip:

from __future__ import print_function

import urllib
import boto3
import os
import subprocess

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')

s3 = boto3.client('s3')

def lambda_handler(event, context):

    # Get the bucket and object from the event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')

    try:
        print("Bucket: " + bucket)
        print("Key: " + key)

        imgfilepath = '/tmp/image.png'
        jsonfilepath = '/tmp/result.txt'
        exportfile = key + '.txt'

        print("Export: " + exportfile)

        s3.download_file(bucket, key, imgfilepath)

        command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
            LIB_DIR,
            SCRIPT_DIR,
            SCRIPT_DIR,
            imgfilepath,
            jsonfilepath,
        )

        try:
            output = subprocess.check_output(command, shell=True)
            print(output)
            s3.upload_file(jsonfilepath, bucket, exportfile)
        except subprocess.CalledProcessError as e:
            print(e.output)

    except Exception as e:
        print(e)
        print('Error processing object {} from bucket {}.'.format(key, bucket))
        raise e

When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.

IMPORTANT

From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).

Thanks for the comment, SergioArcos.

160

answered Oct 22 '22 06:10

José Augusto Paiva

Adapatations for tesseract 4:

Tesseract offers much improvements in version 4, thanks to a neural network. I've tried it with some scans and the improvements are quite substantial. Plus the whole package was 25% smaller in my case. Planned release date of version 4 is first half of 2018.

The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full. I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step.

To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance:

Compile leptonica

cd ~
sudo yum install clang -y
sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y
wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz
tar -xzvf leptonica-1.75.1.tar.gz
cd leptonica-1.75.1
./configure && make && sudo make install

Compile autoconf-archive

Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own:

cd ~
wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz
tar -xvf autoconf-archive-2017.09.28.tar.xz
cd autoconf-archive-2017.09.28
./configure && make && sudo make install
sudo cp m4/* /usr/share/aclocal/

Compile tesseract

cd ~
sudo yum install git-core libtool pkgconfig -y
git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
cd tesseract-ocr
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
sudo make install

Get all needed files and zip

cd ~
mkdir tesseract-standalone
cd tesseract-standalone
cp /usr/local/bin/tesseract .
mkdir lib
cp /usr/local/lib/libtesseract.so.4 lib/
cp /usr/local/lib/liblept.so.5 lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/
mkdir tessdata
cd tessdata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
# additionally any other language you want to use, e.g. `deu` for Deutsch
mkdir configs
cp /usr/local/share/tessdata/configs/pdf configs/
cp /usr/local/share/tessdata/pdf.ttf .
cd ..
zip -r ~/tesseract-standalone.zip *

answered Oct 22 '22 05:10

hansaplast

Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7

I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function. Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2! This process takes less than 20 minutes using these steps and is reliably reproducible.

Summarized Steps:

Start an Amazon Linux EC2 instance (t2 micro will do just fine)

sudo yum update
sudo yum install git-core -y
sudo yum install docker -y
sudo service docker start
sudo usermod -a -G docker ec2-user #allows ec2-user to call docker

After running the 5th command you will need to logout and log back in for the change to take effect.

git clone https://github.com/amtam0/lambda-tesseract-api.git
cd lambda-tesseract-api/
bash build_tesseract4.sh #takes a few minutes
bash build_py37_pkgs.sh

This will generate .zip files for tesseract, pytesseract, pillow, and opencv. In order to use with lambda you need to complete two more steps.

Create Lambda layers, one for each zip file, and attach the layers to your Lambda function.
Create an Environment Variable. Key : PYTHONPATH and Value : /opt/

(Note: you will probably need to increase your Memory allocation and Timeout)

At this point you are all set to upload your code and start using Tesseract on AWS Lambda! Refer back to the Medium article for a test script.

answered Oct 22 '22 04:10

Alex Albracht

Related questions
                            
                                How to programmatically create a "Back" UIBarButton item in Swift?
                            
                                Include all Files in Bin folder in Wix installer
                            
                                Install Cuda without root
                            
                                Webpack not found, deploying to Heroku
                            
                                babel-node is not recognized as an internal or external command, operable program or batch file
                            
                                CONCAT columns with Laravel 5 eloquent
                            
                                Could not find com.android.tools.build:gradle:3.5
                            
                                How to move Android Studio's menu bar
                            
                                TypeError: Cannot read property 'push' of undefined -Express
                            
                                Angular 5 View Not Updating After Timeout
                            
                                View is not rerendered in Nested ForEach loop
                            
                                How to reload or refresh only child component in Angular 8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With