I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document. Problem description: - It showing coordinates but let me know that are these coordinates representing pixel or something else. These are in four like title="bbox 10 13 43 46" , so what is 10, 13 43 and 46. What position they are representing complete code after extracting <pre class="prettyprint"><code> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html> <head> <title> </title> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name='ocr-system' content='tesseract'/> </head> <body> <div class='ocr_page' id='page_1' title='image "D:\ABC.tif"; bbox 0 0 464 101'> <div class='ocr_carea' id='block_1_1' title="bbox 10 13 330 55"> hi whats up </div> </div> </body> </html> </code></pre>

Well for anybody who still is wondering how the coordinate system is working, i finally found it and this is like 10 13 43 46 startx, starty, endx, endy if you want to find width and height of the word that would be width = endx - startx, height = endy - starty split the string with ' ' and then eliminate bbox and there you go..

Not able to understand coordinate in extracted document using OCR engine tesseract

Tags:

text-extraction

ocr

tesseract

hocr

I have extracted a image document from tesseract and It has extracted successful. But I am not able to understand coordinate of extracted document.

Problem description: -

It showing coordinates but let me know that are these coordinates representing pixel or something else. These are in four like title="bbox 10 13 43 46" , so what is 10, 13 43 and 46. What position they are representing

complete code after extracting

   <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>
</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='tesseract'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "D:\ABC.tif"; bbox 0 0 464 101'>
    <div class='ocr_carea' id='block_1_1' title="bbox 10 13 330 55">
    <p 1class='ocr_par'>
        <span class='ocr_line' id='line_1_1' title="bbox 10 13 330 55">
            <span class='ocr_word' id='word_1_1' title="bbox 10 13 43 46">
                <span class='ocrx_word' id='xword_1_1' title="x_wconf -1"><strong>hi</strong></span>
            </span> 
            <span class='ocr_word' id='word_1_2' title="bbox 148 13 268 47">
                <span class='ocrx_word' id='xword_1_2' title="x_wconf -1"><strong>whats</strong></span>
            </span> 
            <span class='ocr_word' id='word_1_3' title="bbox 283 22 330 55">
                <span class='ocrx_word' id='xword_1_3' title="x_wconf -1"><strong>up</strong></span>
            </span>
        </span>
    </p>
    </div>
</div>
</body>
</html>

504

asked Aug 31 '13 16:08

shiv.mymail

2 Answers

Well for anybody who still is wondering how the coordinate system is working, i finally found it and this is like

10 13 43 46 startx, starty, endx, endy

if you want to find width and height of the word that would be

width = endx - startx, height = endy - starty

split the string with ' ' and then eliminate bbox and there you go..

151

answered Sep 22 '22 14:09

AbdulMueed

Maybe this will help someone in the future. I think the image speaks for itself. You can compute the height or top distance (for css) from those values (eg. height = y1-y0) enter image description here

answered Sep 22 '22 14:09

hepifish

Related questions
                            
                                How to use the Amazon Textract with PDF files
                            
                                How to scan in/to a Webapplication [closed]
                            
                                How to integrate Tesseract OCR Library to a C++ program
                            
                                Is there a viable handwriting recognition library / program? [closed]
                            
                                Python: Install Tesseract for Windows 7
                            
                                How to speed up tesseract OCR
                            
                                Digit Recognition with Bayesian classes
                            
                                OCR for extracting text from cedula/passport C#
                            
                                Tesseract - Error : Params model::Incomplete line error
                            
                                Is it possible to extract text from specific portion of image using pytesseract
                            
                                How to gather characters usage statistics in text file using Unix commands?
                            
                                Looking for ANPR/LPR (Automatic Number Plate Recognition/License Plate Recognition) SDK for android [closed]
                            
                                Onenote OCR capabilities in a desktop software
                            
                                Empty string with Tesseract
                            
                                Identify text areas on a Talmud page
                            
                                Tesseract Ocr Engine Cube mode - Training Tesseract
                            
                                converting pdf to image but after zooming in
                            
                                How to import Tesseract into Angular2 (TypeScript)
                            
                                C++ Library for image recognition: images containing words to string
                            
                                Remove background color in image processing for OCR

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With