Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"language_model_penalty_non_dict_word" has no effect in tesseract 3.01

I'm setting language_model_penalty_non_dict_word through a config file for Tesseract 3.01, but its value doesn't have any effect. I've tried with multiple images, and multiple values for it, but the output for each image is always the same. Another user has noticed the same in a comment in another question.

Edit: After looking inside the source, the variable language_model_penalty_non_dict_word is used only inside the function float LanguageModel::ComputeAdjustedPathCost.

However, this function is never called! It is referenced only by 2 functions - LanguageModel::UpdateBestChoice() and LanguageModel::AddViterbiStateEntry(). I placed breakpoints in those functions, but they weren't being called, as well.

like image 248
sashoalm Avatar asked Apr 23 '15 14:04

sashoalm


1 Answers

After some debugging, I finally found out the reason - the function Wordrec::SegSearch() wasn't being called (and it is up there in the call graph of LanguageModel::ComputeAdjustedPathCost()).

From this code:

  if (enable_new_segsearch) {
    SegSearch(&chunks_record, word->best_choice,
              best_char_choices, word->raw_choice, state);
  } else {
    best_first_search(&chunks_record, best_char_choices, word,
                      state, fixpt, best_state);
  }

So you need to set enable_new_segsearch in the config file:

enable_new_segsearch    1

language_model_penalty_non_freq_dict_word 0.2
language_model_penalty_non_dict_word 0.3
like image 57
sashoalm Avatar answered Nov 17 '22 11:11

sashoalm