How do I increase/decrease the strength of the dictionary in tesseract 3 ?
In the FAQ it says I need to change the value of "NON_WERD" and "GARBAGE_STRING" but they do not exist in Tesseract 3.
Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on these words for the word recognition step.
The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does.
Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License. Originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.
Python-tesseract is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images. Python-tesseract is a wrapper for Google's Tesseract-OCR Engine.
According to http://code.google.com/p/tesseract-ocr/wiki/FAQ, you change these variables:
enable_new_segsearch 1
language_model_penalty_non_freq_dict_word 0.2
language_model_penalty_non_dict_word 0.3
Increase their values to make Tesseract more biased to dictionary words.
Note: You must set enable_new_segsearch
, otherwise they'll have no effect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With