Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR only detect user-words

I've been at this for a bit but I can't seem to restrict Tesseract to only output words from the "user-words" dictionary I built. I don't want anything else, just basic matching against those words.

Does anyone know how to do this?

like image 521
user2467731 Avatar asked Mar 16 '14 02:03

user2467731


People also ask

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

Is Tesseract and Tesseract OCR same?

Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0 license. It can be used directly, or (for programmers) using an API to extract printed text from images. It supports a wide variety of languages.

How does a Tesseract OCR work internally?

Tesseract tests the text lines to determine whether they are fixed pitch. Where it finds fixed pitch text, Tesseract chops the words into characters using the pitch, and disables the chopper and associator on these words for the word recognition step.

How do I use Tesseract to read text from an image?

Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.


1 Answers

Try Tesseract's bazaar configuration.

like image 67
nguyenq Avatar answered Sep 20 '22 00:09

nguyenq