I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems. Here's what I've done so far: <ol> <li> Create training document <code>convert eng.myfont.exp0.pdf eng.myfont.exp0.tif</code> </li> <li> Train Tesseract <code>tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox</code> This created my eng.myfont.exp0.box file. I open the file with moshpytt and make sure it was detected correctly. </li> <li> Feed the box file back into tesseract <code>tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr</code> I have this result: <blockquote> Tesseract Open Source OCR Engine v3.03 with Leptonica APPLY_BOXES: Boxes read from boxfile: 146 Found 146 good blobs. TRAINING ... Font name = myfont.exp0 Generated training data for 6 words </blockquote> <ul> <li>eng.myfont.exp0.box.tr file and eng.myfont.exp0.box.txt generated </li> </ul> </li> <li> try to detect the Character set used in the box file (this is where I get stuck) <code>unicharset_extractor *.box</code> </li> </ol> Result: <blockquote> unicharset_extractor: command not found </blockquote> I also tred <code>unicharset_extractor eng.myfont.exp0.box</code> with the same result. I'm using: <ul> <li>tesseract 3.03</li> <li>leptonica-1.70</li> <li>libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0</li> <li>Ubuntu 14.04.1 LTS</li> </ul>

Ok, I googled this for you. Here's the answer: <blockquote> You need to run all commands in the same folder where are located your input files. </blockquote> From: <ul> <li> https://code.google.com/p/tesseract-ocr/issues/detail?id=945 and</li> <li>https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Background_and_Limitations</li> </ul>

Adding New Fonts to Tesseract 3

Tags:

ocr

tesseract

I'm trying to add new fonts to tesseract ocr. I'm following this tutorial but I'm having some problems.

Here's what I've done so far:

Create training document

convert eng.myfont.exp0.pdf eng.myfont.exp0.tif
Train Tesseract

tesseract eng.myfont.exp0.tif eng.myfont.exp0 batch.nochop makebox

This created my eng.myfont.exp0.box file.

I open the file with moshpytt and make sure it was detected correctly.
Feed the box file back into tesseract

tesseract eng.myfont.exp0.tif eng.myfont.exp0.box nobatch box.train.stderr

I have this result:

Tesseract Open Source OCR Engine v3.03 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 146
Found 146 good blobs.
TRAINING ... Font name = myfont.exp0
Generated training data for 6 words
- eng.myfont.exp0.box.tr file and eng.myfont.exp0.box.txt generated
try to detect the Character set used in the box file (this is where I get stuck)

unicharset_extractor *.box

Result:

unicharset_extractor: command not found

I also tred unicharset_extractor eng.myfont.exp0.box with the same result.

I'm using:

tesseract 3.03
leptonica-1.70
libgif 4.1.6(?) : libjpeg 8d : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : webp 0.4.0
Ubuntu 14.04.1 LTS

229

asked Oct 05 '14 17:10

Jose Ismael Reyes

2 Answers

The training tools for Tesseract 3.03 RC were omitted from Ubuntu 14.04. So either fall back to Tesseract 3.02 or upgrade to Ubuntu 14.10, which should have it.

answered Jan 01 '23 13:01

nguyenq

Ok, I googled this for you. Here's the answer:

You need to run all commands in the same folder where are located your input files.

From:

https://code.google.com/p/tesseract-ocr/issues/detail?id=945 and
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Background_and_Limitations

answered Jan 01 '23 15:01

mlissner

Related questions
                            
                                Programmatically divide scanned images into separate images
                            
                                How to extract text or numbers from images using python
                            
                                Clean text images with OpenCV for OCR reading
                            
                                ocr and image preprocessing techniques
                            
                                Optical character recognition program for photographs
                            
                                Fast OCR in vb.net [closed]
                            
                                language detection
                            
                                Make text in image thinner for OCR
                            
                                How to find contours only in black colour?
                            
                                Which algorithm is used in google's tesseract-OCR for Recognition?
                            
                                Extracting text out of images
                            
                                How to get only text using OCR recognition feature of Microsoft Cognitive Services - Vision API?
                            
                                Highlighting specific text in an image using python
                            
                                How to convert circled numbers to numbers ? (① to 1)
                            
                                How to extract text from table in image?
                            
                                character reconstruction and filling for OCR
                            
                                Open Source OCR for Arabic [closed]
                            
                                What is the best way to do basic numbers recognition?
                            
                                Convert image to searchable pdf [closed]
                            
                                How to know if a PDF contains only images or has been OCR scanned for searching?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With