Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding New Fonts to Tesseract 3

Tags:

ocr

tesseract

I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems.

Here's what I've done so far:

  1. Create training document

    convert eng.myfont.exp0.pdf eng.myfont.exp0.tif

  2. Train Tesseract

    tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox

    This created my eng.myfont.exp0.box file.

    I open the file with moshpytt and make sure it was detected correctly.

  3. Feed the box file back into tesseract

    tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr

    I have this result:

    Tesseract Open Source OCR Engine v3.03 with Leptonica
    APPLY_BOXES:
    Boxes read from boxfile: 146
    Found 146 good blobs.
    TRAINING ... Font name = myfont.exp0
    Generated training data for 6 words

    • eng.myfont.exp0.box.tr file and eng.myfont.exp0.box.txt generated
  4. try to detect the Character set used in the box file (this is where I get stuck)

    unicharset_extractor *.box

Result:

unicharset_extractor: command not found

I also tred unicharset_extractor eng.myfont.exp0.box with the same result.

I'm using:

  • tesseract 3.03
  • leptonica-1.70
  • libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
  • Ubuntu 14.04.1 LTS
like image 229
Jose Ismael Reyes Avatar asked Oct 05 '14 17:10

Jose Ismael Reyes


People also ask

Can Tesseract be trained?

Luckily, you can train your Tesseract so it can read your font easily.

Can OCR be trained?

OCR training works best if training images contain blocks of many words. You can use the insertText function to automatically generate training images for a known font. Remove any noisy images.

What is jTessBoxEditor?

jTessBoxEditor is a box editor and trainer for Tesseract OCR. It provides box data editing for both Tesseract 2.0x and 3.0x formats, and full automation of Tesseract training. It can read images of common image formats, including multi-page TIFF. The program requires Java Runtime Environment 7 or later.

Where is Tesstrain sh?

tesstrain.sh needs certain files to use in the training process. These are normally stored in a 'langdata' directory. The langdata for the languages that are officially supported by Tesseract are all stored in the langdata repository, but you can of course store langdata wherever you want.


2 Answers

The training tools for Tesseract 3.03 RC were omitted from Ubuntu 14.04. So either fall back to Tesseract 3.02 or upgrade to Ubuntu 14.10, which should have it.

like image 89
nguyenq Avatar answered Jan 01 '23 13:01

nguyenq


Ok, I googled this for you. Here's the answer:

You need to run all commands in the same folder where are located your input files.

From:

  • https://code.google.com/p/tesseract-ocr/issues/detail?id=945 and
  • https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Background_and_Limitations
like image 24
mlissner Avatar answered Jan 01 '23 15:01

mlissner