I am using this repository to deploy tesseract as a lambda layer: https://github.com/bweigel/aws-lambda-tesseract-layer
The deployment works well and other functions that pytesseract
has like: image_to_string
, image_to_data
also works well without any hiccups.
But, when I try to use image_to_pdf_or_hocr
like this:
pdf = pytesseract.image_to_pdf_or_hocr(f'/tmp/{file_name}/{page.number}.png', extension='pdf')
it does not work and throws error like:
Traceback (most recent call last):
File "/var/task/helpers/ocr_helper.py", line 36, in save_searchable_pdf
f'/tmp/{file_name}/{page.number}.png', extension='pdf')
File "/var/task/pytesseract/pytesseract.py", line 432, in image_to_pdf_or_hocr
return run_and_get_output(*args)
File "/var/task/pytesseract/pytesseract.py", line 289, in run_and_get_output
with open(filename, 'rb') as output_file:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tess_6_hu78b0.pdf'
tess_6_hu78b0.pdf
does not exist. What does this mean? I have no file with tess_6_hu78b0
name to begin with.image_to_pdf_or_hocr
function is 100% correct and an image is present there. I have confirmed and the same thing works on my local.I have tried:
I found somewhere that I needed to install libtesseract-dev
too. Hence, I modified my dockerfile as:
FROM lambci/lambda:build-python3.6
RUN sudo apt install tesseract-ocr
RUN sudo apt install libtesseract-dev
but unfortunately this too did not work.
After 18 hours of hard work, I was finally able to figure it out.
It turns out that https://github.com/bweigel/aws-lambda-tesseract-layer is not bundled with all the necessary files for pytesseract.image_to_pdf_or_hocr()
to run.
So what I did was, I build leptonica
and tesseract
from source and generated
These required files are available here: https://github.com/prameshbajra/tessdata
Inside https://github.com/bweigel/aws-lambda-tesseract-layer, under ready-to-use
folder there is a directory named amazonlinux-1
, and inside it, there is a folder named tesseract/share/tessdata
. All you need to do is paste in the above listed files under this directory.
Just download this repo and replace the tessdata
folder.
Note: This tessdata is build with tesseract 4.1.1
I hope this helps future readers. Happy coding.
Thank Benjamin Genz (@bweigel) for publishing this repo. You made our lives easier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With