I am currently using Tika to extract text from files uploaded to my Rails app running on AWS Elastic Beanstalk (64bit Amazon Linux 2016.03 v2.1.2 running Ruby 2.2). I'd like to index scanned images as well, so I need to install Tesseract.
I was able to get it to work by installing it from source like so, but it added 10 minutes to my deploys to a fresh instance. Is there a faster way to do this?
.ebextensions/02-tesseract.config
packages:
yum:
autoconf: []
automake: []
libtool: []
libpng-devel: []
libtiff-devel: []
zlib-devel: []
container_commands:
01-command:
command: mkdir -p install
cwd: /home/ec2-user
02-command:
command: cp .ebextensions/scripts/install_tesseract.sh /home/ec2-user/install/
03-command:
command: bash install/install_tesseract.sh
cwd: /home/ec2-user
.ebextensions/scripts/install_tesseract.sh
#!/usr/bin/env bash
cd_to_install () {
cd /home/ec2-user/install
}
cd_to () {
cd /home/ec2-user/install/$1
}
if ! [ -x "$(command -v tesseract)" ]; then
# Add `usr/local/bin` to PATH
echo 'pathmunge /usr/local/bin' > /etc/profile.d/usr_local.sh
chmod +x /etc/profile.d/usr_local.sh
# Install leptonica
cd_to_install
wget http://www.leptonica.org/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd_to leptonica-1.73
./configure
make
make install
rm -rf /home/ec2-user/install/leptonica-1.73.tar.gz
rm -rf /home/ec2-user/install/leptonica-1.73
# Install tesseract ~ the jewel of Odin's treasure room
cd_to_install
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd_to tesseract-3.04.01
./autogen.sh
./configure
make
make install
ldconfig
rm -rf /home/ec2-user/install/3.04.01.tar.gz
rm -rf /home/ec2-user/install/tesseract-3.04.01
# Install tessdata
cd_to_install
wget https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
tar -zxvf 3.04.00.tar.gz
cp /home/ec2-user/install/tessdata-3.04.00/eng.* /usr/local/share/tessdata/
rm -rf /home/ec2-user/install/3.04.00.tar.gz
rm -rf /home/ec2-user/install/tessdata-3.04.00
fi
To do this: Download the latest SW (Software Network https://software-network.org/client/ ) client from https://software-network.org/client/ . Checkout tesseract sources git clone https://github.com/tesseract-ocr/tesseract tesseract && cd tesseract . Run sw build .
Short answer
.ebextensions/02-tesseract.config
commands:
01-libwebp:
command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
02-tesseract:
command: "yum --enablerepo=epel -y install tesseract"
Long answer
I'm not familiar with non-Ubuntu package managers or ebextensions, so after some digging, I found that there are precompiled binaries that can be installed on Amazon Linux in the stable EPEL repo.
The first obstacle was figuring out how to use the EPEL repo. The easiest way is to use the enablerepo
option on the yum
command.
That gets us here:
yum --enablerepo=epel install tesseract
Next, I had to resolve this dependency error:
[root@ip-10-0-1-193 ec2-user]# yum install --enablerepo=epel tesseract
Loaded plugins: priorities, update-motd, upgrade-helper
951 packages excluded due to repository priority protections
Resolving Dependencies
--> Running transaction check
---> Package tesseract.x86_64 0:3.04.00-3.el6 will be installed
--> Processing Dependency: liblept.so.4()(64bit) for package: tesseract-3.04.00-3.el6.x86_64
--> Running transaction check
---> Package leptonica.x86_64 0:1.72-2.el6 will be installed
--> Processing Dependency: libwebp.so.5()(64bit) for package: leptonica-1.72-2.el6.x86_64
--> Finished Dependency Resolution
Error: Package: leptonica-1.72-2.el6.x86_64 (epel)
Requires: libwebp.so.5()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
I found the solution here
Just adding the epel repo doesn't solve it, as the packages in the amzn-main repository seem to overrule those in the epel repository. If the libwebp package in the amzn-main repo are excluded it should work
The Tesseract install has some dependencies found in the amzn-main
repo. This is why I first install libwebp
with --disablerepo=amzn-main
.
yum --enablerepo=epel --disablerepo=amzn-main install libwebp
yum --enablerepo=epel install tesseract
Finally, here's how you can install yum packages on Elastic Beanstalk with options:
.ebextensions/02-tesseract.config
commands:
01-libwebp:
command: "yum --enablerepo=epel --disablerepo=amzn-main -y install libwebp"
02-tesseract:
command: "yum --enablerepo=epel -y install tesseract"
Fortunately, this is also the easiest way to install Tesseract on Elastic Beanstalk!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With