Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to install Tesseract on Elastic Beanstalk

I am currently using Tika to extract text from files uploaded to my Rails app running on AWS Elastic Beanstalk (64bit Amazon Linux 2016.03 v2.1.2 running Ruby 2.2). I'd like to index scanned images as well, so I need to install Tesseract.

I was able to get it to work by installing it from source like so, but it added 10 minutes to my deploys to a fresh instance. Is there a faster way to do this?

.ebextensions/02-tesseract.config

packages:
  yum:
    autoconf: []
    automake: []
    libtool: []
    libpng-devel: []
    libtiff-devel: []
    zlib-devel: []

container_commands:
  01-command:
    command: mkdir -p install
    cwd: /home/ec2-user
  02-command:
    command: cp .ebextensions/scripts/install_tesseract.sh /home/ec2-user/install/
  03-command:
    command: bash install/install_tesseract.sh
    cwd: /home/ec2-user

.ebextensions/scripts/install_tesseract.sh

#!/usr/bin/env bash

cd_to_install () {
  cd /home/ec2-user/install
}

cd_to () {
  cd /home/ec2-user/install/$1
}

if ! [ -x "$(command -v tesseract)" ]; then
  # Add `usr/local/bin` to PATH
  echo 'pathmunge /usr/local/bin' > /etc/profile.d/usr_local.sh
  chmod +x /etc/profile.d/usr_local.sh

  # Install leptonica
  cd_to_install
  wget http://www.leptonica.org/source/leptonica-1.73.tar.gz
  tar -zxvf leptonica-1.73.tar.gz
  cd_to leptonica-1.73
  ./configure
  make
  make install
  rm -rf /home/ec2-user/install/leptonica-1.73.tar.gz
  rm -rf /home/ec2-user/install/leptonica-1.73

  # Install tesseract ~ the jewel of Odin's treasure room
  cd_to_install
  wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
  tar -zxvf 3.04.01.tar.gz
  cd_to tesseract-3.04.01
  ./autogen.sh
  ./configure
  make
  make install
  ldconfig
  rm -rf /home/ec2-user/install/3.04.01.tar.gz
  rm -rf /home/ec2-user/install/tesseract-3.04.01

  # Install tessdata
  cd_to_install
  wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
  tar -zxvf 3.04.00.tar.gz
  cp /home/ec2-user/install/tessdata-3.04.00/eng.* /usr/local/share/tessdata/
  rm -rf /home/ec2-user/install/3.04.00.tar.gz
  rm -rf /home/ec2-user/install/tessdata-3.04.00
fi
like image 298
monozok Avatar asked Jun 28 '16 02:06

monozok


People also ask

How do I install Tesseract from source?

To do this: Download the latest SW (Software Network https://software-network.org/client/ ) client from https://software-network.org/client/ . Checkout tesseract sources git clone https://github.com/tesseract-ocr/tesseract tesseract && cd tesseract . Run sw build .


1 Answers

Short answer

.ebextensions/02-tesseract.config

commands:
  01-libwebp:
    command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
  02-tesseract:
    command: "yum --enablerepo=epel -y install tesseract"

Long answer

I'm not familiar with non-Ubuntu package managers or ebextensions, so after some digging, I found that there are precompiled binaries that can be installed on Amazon Linux in the stable EPEL repo.

The first obstacle was figuring out how to use the EPEL repo. The easiest way is to use the enablerepo option on the yum command.

That gets us here:

yum --enablerepo=epel install tesseract

Next, I had to resolve this dependency error:

[root@ip-10-0-1-193 ec2-user]# yum install --enablerepo=epel tesseract
Loaded plugins: priorities, update-motd, upgrade-helper
951 packages excluded due to repository priority protections
Resolving Dependencies
--> Running transaction check
---> Package tesseract.x86_64 0:3.04.00-3.el6 will be installed
--> Processing Dependency: liblept.so.4()(64bit) for package: tesseract-3.04.00-3.el6.x86_64
--> Running transaction check
---> Package leptonica.x86_64 0:1.72-2.el6 will be installed
--> Processing Dependency: libwebp.so.5()(64bit) for package: leptonica-1.72-2.el6.x86_64
--> Finished Dependency Resolution
Error: Package: leptonica-1.72-2.el6.x86_64 (epel)
           Requires: libwebp.so.5()(64bit)
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

I found the solution here

Just adding the epel repo doesn't solve it, as the packages in the amzn-main repository seem to overrule those in the epel repository. If the libwebp package in the amzn-main repo are excluded it should work

The Tesseract install has some dependencies found in the amzn-main repo. This is why I first install libwebp with --disablerepo=amzn-main.

yum --enablerepo=epel --disablerepo=amzn-main install libwebp
yum --enablerepo=epel install tesseract

Finally, here's how you can install yum packages on Elastic Beanstalk with options:

.ebextensions/02-tesseract.config

commands:
  01-libwebp:
    command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
  02-tesseract:
    command: "yum --enablerepo=epel -y install tesseract"

Fortunately, this is also the easiest way to install Tesseract on Elastic Beanstalk!

like image 199
monozok Avatar answered Oct 02 '22 01:10

monozok