Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get the exact position of text from image in tesseract

Using GetHOCRText(0) method in tesseract I'm able to retrieve the text in html and on presenting the html in webview i'm able get the text but the postion of text in image is different from the output. Any idea is highly helpful.

 tesseract->SetInputName("word");
tesseract->SetOutputName("xyz");
tesseract->Recognize(NULL);


char *utf8Text=tesseract->GetHOCRText(0);

This the image i'm using for tesseract

and output imageenter image description here

like image 404
srividya Avatar asked Sep 05 '12 10:09

srividya


People also ask

How do I extract text from an image using Tesseract?

Create a Python tesseract scriptCreate a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

Why is the Tesseract OCR not accurate?

Inevitably, noise in an input image, non-standard fonts that Tesseract wasn't trained on, or less than ideal image quality will cause Tesseract to make a mistake and incorrectly OCR a piece of text.

How accurate is Tesseract OCR?

The following results are presented for Tesseract: the original set of samples achieves a precision of 0.907 and 0.901 recall rate, while the preprocessed set leads to a precision of 0.929 and a recall of 0.928.


2 Answers

If you have the hocr output, you should have a tag for each word. These tags should have class="ocrx_word" and name="bbox x1 y1 x2 y2" where the x and y are the top left and bottom right corner of the bounding box around the word. I don't think it's possible to automatically use this information to format a text document - would require translating pixel differences to number of tabs/spaces. But, you should be able to render text in the given location.

like image 58
Mongoose1021 Avatar answered Sep 20 '22 15:09

Mongoose1021


GetBoxText() method will return exact position of each characters in an array.

char *boxtext = _tesseract->GetBoxText(0);
NSString* aBoxText = [NSString stringWithUTF8String:boxtext];
like image 32
Ab'initio Avatar answered Sep 22 '22 15:09

Ab'initio