Tesseract user-patterns

Tags:

tesseract

Any one know how to use the user patterns (user_patterns_suffix) in Tesseract? Could you advise me how to do with it and how to test it working? I tried to follow Tesseract guide (Tesseract user-patterns but I didn't see it affected the result at all.

Thanks.

900

asked Jun 20 '13 09:06

kha nguyen

1 Answers

Each pattern can contain any non-whitespace characters, however only the patterns that contain characters from the unicharset of the corresponding language will be useful.

The only meta character is \. To be used in a pattern as an ordinary string it should be escaped with \ (e.g. string C:\Documents should be written in the patterns file as C:\\Documents).

This function supports a very limited regular expression syntax. One can express a character, a certain character class and a number of times the entity should be repeated in the pattern.

To denote a character class use one of:

\c - unichar for which UNICHARSET::get_isalpha() is true (character)

\d - unichar for which UNICHARSET::get_isdigit() is true

\n - unichar for which UNICHARSET::get_isdigit() and UNICHARSET::isalpha() are true

\p - unichar for which UNICHARSET::get_ispunct() is true

\a - unichar for which UNICHARSET::get_islower() is true

\A - unichar for which UNICHARSET::get_isupper() is true

\* could be specified after each character or pattern to indicate that the character/pattern can be repeated any number of times before the next character/pattern occurs.

Examples:

1-8\d\d-GOOG-411 will be expanded to strings: 1-800-GOOG-411, 1-801-GOOG-411, ... 1-899-GOOG-411.

"ww.\n\*.com" will be expanded to strings like: "ww.a.com" "ww.a123.com" ... "ww.ABCDefgHIJKLMNop.com"

Note: In choosing which patterns to include please be aware of the fact providing very generic patterns will make tesseract run slower. For example \n\* at the beginning of the pattern will make Tesseract consider all the combinations of proposed character choices for each of the segmentations, which will be unacceptably slow. Because of potential problems with speed that could be difficult to identify, each user pattern has to have at least kSaneNumConcreteChars concrete characters from the unicharset at the beginning.

answered Jan 18 '23 09:01

stuartthomas25

Related questions
                            
                                Where I can find the list of available property name for tesseract->setvariable function's first parameter?
                            
                                How does one install Tesseract-OCR 3.03 in Ubuntu/Linux distributions?
                            
                                Open-CV - Not loading correctly
                            
                                Difference between Tesseract 3 and Tesseract 4?
                            
                                OCR: Image to text?
                            
                                Python error when importing image_to_string from tesseract
                            
                                Custom Dictionary for Tesseract
                            
                                Image preprocessing with OpenCV before doing character recognition (tesseract)
                            
                                "Adding" new fonts to Tesseract eng.traineddata
                            
                                Convert scanned pdf to .txt files using tesseract
                            
                                How to install language in tesseract OCR
                            
                                iOS Tesseract OCR Image Preperation
                            
                                Improve Tesseract OCR results with blurred text
                            
                                Image processing for OCR with leptonica (inverse color text)
                            
                                Alternative to Tesseract OCR Training?
                            
                                Suggestions for digit recognition
                            
                                Can I use OCR to detect font style (bold, italic)? [closed]
                            
                                what's the best image input type for tesseract?
                            
                                chinese character recognition using Tesseract OCR

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With