Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract user-patterns

Tags:

tesseract

Any one know how to use the user patterns (user_patterns_suffix) in Tesseract? Could you advise me how to do with it and how to test it working? I tried to follow Tesseract guide (Tesseract user-patterns but I didn't see it affected the result at all.

Thanks.

like image 900
kha nguyen Avatar asked Jun 20 '13 09:06

kha nguyen


People also ask

What is a Tesseract pattern?

Tesseract uses a pattern to a a sort of "regular expression". It can be used if lets say you were scanning a book with data that was all in the same format. A pattern can be used to tell Tesseract what formats to expect, ike how it expect words in user-words.

What is OEM in Tesseract?

Engine Mode ( --oem ). Tesseract has several engine modes with different performance and speed. Tesseract 4 have introduced additional LSTM neural net mode, which often works best.

How does a Tesseract OCR work internally?

Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on these words for the word recognition step.

What is Tesseract data?

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.


1 Answers

Tesseract uses a pattern to a a sort of "regular expression". It can be used if lets say you were scanning a book with data that was all in the same format. A pattern can be used to tell Tesseract what formats to expect, ike how it expect words in user-words. Below is how Tesseract describes how to use patterns:

Each pattern can contain any non-whitespace characters, however only the patterns that contain characters from the unicharset of the corresponding language will be useful.

The only meta character is \. To be used in a pattern as an ordinary string it should be escaped with \ (e.g. string C:\Documents should be written in the patterns file as C:\\Documents).

This function supports a very limited regular expression syntax. One can express a character, a certain character class and a number of times the entity should be repeated in the pattern.

To denote a character class use one of:

  • \c - unichar for which UNICHARSET::get_isalpha() is true (character)
  • \d - unichar for which UNICHARSET::get_isdigit() is true
  • \n - unichar for which UNICHARSET::get_isdigit() and UNICHARSET::isalpha() are true
  • \p - unichar for which UNICHARSET::get_ispunct() is true
  • \a - unichar for which UNICHARSET::get_islower() is true
  • \A - unichar for which UNICHARSET::get_isupper() is true

\* could be specified after each character or pattern to indicate that the character/pattern can be repeated any number of times before the next character/pattern occurs.

Examples:

1-8\d\d-GOOG-411 will be expanded to strings: 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.

"ww.\n\*.com" will be expanded to strings like: "ww.a.com" "ww.a123.com" ... "ww.ABCDefgHIJKLMNop.com"

Note: In choosing which patterns to include please be aware of the fact providing very generic patterns will make tesseract run slower. For example \n\* at the beginning of the pattern will make Tesseract consider all the combinations of proposed character choices for each of the segmentations, which will be unacceptably slow. Because of potential problems with speed that could be difficult to identify, each user pattern has to have at least kSaneNumConcreteChars concrete characters from the unicharset at the beginning.

like image 83
stuartthomas25 Avatar answered Jan 18 '23 09:01

stuartthomas25