How can I disable dictionary corrections when running Tesseract for English language?
I'm currently running tesseract as a child process.
Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.
Engine Mode ( --oem ). Tesseract has several engine modes with different performance and speed. Tesseract 4 have introduced additional LSTM neural net mode, which often works best.
Try to set these variables (put them in a config file) to false:
load_system_dawg
load_freq_dawg
load_punc_dawg
load_number_dawg
load_unambig_dawg
load_bigram_dawg
load_fixed_length_dawgs
https://groups.google.com/forum/?fromgroups=#!searchin/tesseract-ocr/Disable$20dictionary$20in$20Tesseract/tesseract-ocr/5nvIo1DJxHE/f3gBi2pTKykJ
Also read How to increase the trust in/strength of the dictionary? in the FAQ. From it:
For tesseract-ocr < 3.01 try upping NON_WERD and GARBAGE_STRING in dict/permute.cpp to maybe 3 or even 5.
For tesseract-ocr >= 3.01 try increasing the variables language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word in a config file. By default they are 0.1 and 0.15 respectively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With