Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Not able to understand coordinate in extracted document using OCR engine tesseract

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document.

Problem description: -

It showing coordinates but let me know that are these coordinates representing pixel or something else. These are in four like title="bbox 10 13 43 46" , so what is 10, 13 43 and 46. What position they are representing

complete code after extracting

   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>
</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "D:\ABC.tif"; bbox 0 0 464 101'>
    <div class='ocr_carea' id='block_1_1' title="bbox 10 13 330 55">
    <p 1class='ocr_par'>
        <span class='ocr_line' id='line_1_1' title="bbox 10 13 330 55">
            <span class='ocr_word' id='word_1_1' title="bbox 10 13 43 46">
                <span class='ocrx_word' id='xword_1_1' title="x_wconf -1"><strong>hi</strong></span>
            </span> 
            <span class='ocr_word' id='word_1_2' title="bbox 148 13 268 47">
                <span class='ocrx_word' id='xword_1_2' title="x_wconf -1"><strong>whats</strong></span>
            </span> 
            <span class='ocr_word' id='word_1_3' title="bbox 283 22 330 55">
                <span class='ocrx_word' id='xword_1_3' title="x_wconf -1"><strong>up</strong></span>
            </span>
        </span>
    </p>
    </div>
</div>
</body>
</html>
like image 504
shiv.mymail Avatar asked Aug 31 '13 16:08

shiv.mymail


People also ask

Is Tesseract good for OCR?

Tesseract does various image processing operations internally (using the Leptonica library) before doing the actual OCR. It generally does a very good job of this, but there will inevitably be cases where it isn't good enough, which can result in a significant reduction in accuracy.

Can Tesseract OCR read PDF?

Tesseract reads only image files, not pdf. You can convert PDF to image (tif, png) and OCR those. Or use wrappers that use tesseract.

Can Tesseract recognize numbers?

Python Tesseract 4.0 OCR: Recognize only Numbers / Digits and exclude all other Characters. Googles Tesseract (originally from HP) is one of the most popular, free Optical Character Recognition (OCR) software out there. It can be used with several programming languages because many wrappers exist for this project.


2 Answers

Well for anybody who still is wondering how the coordinate system is working, i finally found it and this is like

10 13 43 46 startx, starty, endx, endy

if you want to find width and height of the word that would be

width = endx - startx, height = endy - starty

split the string with ' ' and then eliminate bbox and there you go..

like image 151
AbdulMueed Avatar answered Sep 22 '22 14:09

AbdulMueed


Maybe this will help someone in the future. I think the image speaks for itself. You can compute the height or top distance (for css) from those values (eg. height = y1-y0) enter image description here

like image 27
hepifish Avatar answered Sep 22 '22 14:09

hepifish