Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strength of Dictionary in Tesseract 3

Tags:

ocr

tesseract

How do I increase/decrease the strength of the dictionary in tesseract 3 ?

In the FAQ it says I need to change the value of "NON_WERD" and "GARBAGE_STRING" but they do not exist in Tesseract 3.

like image 969
William Lopes Avatar asked Jan 20 '12 11:01

William Lopes


People also ask

How does OCR work Tesseract?

Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on these words for the word recognition step.

What algorithm does Tesseract use?

The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does.

Who made Tesseract OCR?

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

How does Python Tesseract work?

Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.


1 Answers

According to http://code.google.com/p/tesseract-ocr/wiki/FAQ, you change these variables:

enable_new_segsearch    1
language_model_penalty_non_freq_dict_word 0.2
language_model_penalty_non_dict_word 0.3

Increase their values to make Tesseract more biased to dictionary words.

Note: You must set enable_new_segsearch, otherwise they'll have no effect.

like image 178
roocell Avatar answered Oct 07 '22 14:10

roocell