I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems.
Here's what I've done so far:
Create training document
convert eng.myfont.exp0.pdf eng.myfont.exp0.tif
Train Tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox
This created my eng.myfont.exp0.box file.
I open the file with moshpytt and make sure it was detected correctly.
Feed the box file back into tesseract
tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr
I have this result:
Tesseract Open Source OCR Engine v3.03 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 146
Found 146 good blobs.
TRAINING ... Font name = myfont.exp0
Generated training data for 6 words
try to detect the Character set used in the box file (this is where I get stuck)
unicharset_extractor *.box
Result:
unicharset_extractor: command not found
I also tred unicharset_extractor eng.myfont.exp0.box
with the same result.
I'm using:
Luckily, you can train your Tesseract so it can read your font easily.
OCR training works best if training images contain blocks of many words. You can use the insertText function to automatically generate training images for a known font. Remove any noisy images.
jTessBoxEditor is a box editor and trainer for Tesseract OCR. It provides box data editing for both Tesseract 2.0x and 3.0x formats, and full automation of Tesseract training. It can read images of common image formats, including multi-page TIFF. The program requires Java Runtime Environment 7 or later.
tesstrain.sh needs certain files to use in the training process. These are normally stored in a 'langdata' directory. The langdata for the languages that are officially supported by Tesseract are all stored in the langdata repository, but you can of course store langdata wherever you want.
The training tools for Tesseract 3.03 RC were omitted from Ubuntu 14.04. So either fall back to Tesseract 3.02 or upgrade to Ubuntu 14.10, which should have it.
Ok, I googled this for you. Here's the answer:
You need to run all commands in the same folder where are located your input files.
From:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With