<p>I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create: <img src="https://i.stack.imgur.com/tmyjJ.png" alt="Image of segmented and labelled output"></p> <p>I have built the latest version of tesseract using brew, <code>brew install tesseract --HEAD</code>, and have been trying to edit config files located in <code>/usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/</code> to output labelled boxes. The output received using <code>hocr</code> as the config, i.e.</p> <pre class="prettyprint"><code>tesseract infile.tiff outfile_stem -l eng -psm 1 hocr </code></pre> <p>gives a bounding box for everything and has some labelling in <code>class</code> tags e.g.</p> <pre class="prettyprint"><code><p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589"> <span class='ocr_line' id='line_5_142' ... </code></pre> <p>but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?</p> <p>The current head version details:</p> <pre class="prettyprint"><code>tesseract 3.04.00 leptonica-1.71 libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5 </code></pre> <hr> <h3>Edit</h3> <p>I'm really looking to achieve this using the command line tool (as in examples above). @nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?</p>

<p>Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github. </p> <p>Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.</p> <h3>Overview</h3> <ol> <li>Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT) </li> <li>Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml</li> <li>Use the PVT to view the original image with the PAGE xml information overlaid</li> </ol> <h3>Code</h3> <pre class="prettyprint lang-bash prettyprint-override"><code>brew install wine # takes a little while >10m brew install gs # only for generating a tif example. Not required, you can use Preview brew install wget # only for downloading example paper. Not required, you can do so manually! cd ~/Downloads wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf" # This command can be ommitted and you can do the conversion to tiff with Preview gs \ -o paper-%d.tif \ -sDEVICE=tiff24nc \ -r300x300 \ paper.pdf cd ~/Downloads # ttptool is the location you downloaded the Tesseract to PAGE tool to ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3" # sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" touch "$ttptool/log.txt" wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" \ -inp-img "$dl/Downloads/paper-3.tif" \ -out-xml "$dl/Downloads/paper-3-tool.xml" \ -rec-mode layout>>log.txt # pvtool is the location you downloaded the PAGE Viewer tool to pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)" cd "$pvtool" dl=~ java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif" </code></pre> <h3>Results</h3> <p>Document with overlays (rollover to see text and type) <img src="https://i.stack.imgur.com/d90xf.jpg" alt="Doc with overlays"> Overlays alone (use GUI buttons to toggle) <img src="https://i.stack.imgur.com/UrypO.png" alt="Overlays alone"></p> <h3>Appendix</h3> <p>You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!</p> <pre class="prettyprint"><code># Note that the pvtool does take as input HOCR xml but it ignores the region type brew install tesseract --devel # installs v 3.03 at time of writing tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr mv paper-3.hocr paper-3.xml # The page viewer will only open XML files java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml" </code></pre> <p>At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:</p> <pre class="prettyprint"><code>pctool="/Users/Me/Project/tools/JPageConverter 1.0" java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST </code></pre> <p>Unfortunately, I kept getting null pointers.</p> <pre class="prettyprint"><code>Could not convert to target XML schema format. java.lang.NullPointerException at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65) Could not save target PAGE XML file: paper-3-hocrconvert.xml java.lang.NullPointerException at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144) at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135) at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65) </code></pre>

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

Tags:

ocr

tesseract

hocr

I'm trying to get Tesseract to output a file with labelled bounding boxes that result from page segmentation (pre OCR). I know it must be capable of doing this 'out of the box' because of the results shown at the ICDAR competitions where contestants had to segment and various documents (academic paper here). Here's an example from that paper illustrating what I want to create: Image of segmented and labelled output

I have built the latest version of tesseract using brew, brew install tesseract --HEAD, and have been trying to edit config files located in /usr/local/Cellar/tesseract/HEAD/share/tessdata/configs/ to output labelled boxes. The output received using hocr as the config, i.e.

tesseract infile.tiff outfile_stem -l eng -psm 1 hocr

gives a bounding box for everything and has some labelling in class tags e.g.

<p class='ocr_par' dir='ltr' id='par_5_82' title="bbox 2194 4490 3842 4589">     <span class='ocr_line' id='line_5_142' ...

but I can't visualise this. Is there a standard tool to visualize hOCR files, or is the facility to create an output file with bounding boxes built into Tesseract?

The current head version details:

tesseract 3.04.00  leptonica-1.71   libjpeg 8d : libpng 1.6.16 : libtiff 4.0.3 : zlib 1.2.5

Edit

I'm really looking to achieve this using the command line tool (as in examples above). @nguyenq has pointed me to the API reference, unfortunately I have no c++ experience. If the only solution is to use the API, please can you provide a quick python example?

420

asked Feb 18 '15 18:02

James Owers

1 Answers

Success. Many thanks to the people at the Pattern Recognition and Image Analysis Research Lab (PRImA) for producing tools to handle this. You can obtain them freely on their website or github.

Below I give the full solution for a Mac running 10.10 and using the homebrew package manager. I use wine to run windows executables.

Overview

Download tools: Tesseract OCR to Page (TPT) and Page Viewer (PVT)
Use the TPT to run tesseract on your document and convert the HOCR xml to a PAGE xml
Use the PVT to view the original image with the PAGE xml information overlaid

Code

brew install wine  # takes a little while >10m brew install gs    # only for generating a tif example. Not required, you can use Preview brew install wget  # only for downloading example paper. Not required, you can do so manually! cd ~/Downloads wget -O paper.pdf "http://www.prima.cse.salford.ac.uk/www/assets/papers/ICDAR2013_Antonacopoulos_HNLA2013.pdf" # This command can be ommitted and you can do the conversion to tiff with Preview gs                          \   -o paper-%d.tif           \   -sDEVICE=tiff24nc         \   -r300x300                 \    paper.pdf   cd ~/Downloads # ttptool is the location you downloaded the Tesseract to PAGE tool to ttptool="/Users/Me/Project/tools/TesseractToPAGE 1.3" # sudo chmod 777 "$ttptool/bin/PRImA_Tesseract-1-3-78.exe" touch "$ttptool/log.txt" wine "$ttptool/bin/PRImA_Tesseract-1-3-78.exe"   \   -inp-img "$dl/Downloads/paper-3.tif"           \   -out-xml "$dl/Downloads/paper-3-tool.xml"      \   -rec-mode layout>>log.txt  # pvtool is the location you downloaded the PAGE Viewer tool to pvtool="/Users/Me/Project/tools/PAGEViewerMacOS_1.1/JPageViewer 1.1 (Mac OS, 64 bit)" cd "$pvtool" dl=~ java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3-tool.xml" "$dl/Downloads/paper-3.tif"

Results

Document with overlays (rollover to see text and type) Doc with overlays Overlays alone (use GUI buttons to toggle)

Appendix

You can run tesseract yourself and use another tool to convert its output to PAGE format. I was unable to get this to work but I'm sure you'll be fine!

# Note that the pvtool does take as input HOCR xml but it ignores the region type brew install tesseract --devel  # installs v 3.03 at time of writing tesseract ~/Downloads/paper-3.tif ~/Downloads/paper-3 hocr mv paper-3.hocr paper-3.xml  # The page viewer will only open XML files java -XstartOnFirstThread -jar JPageViewer.jar "$dl/Downloads/paper-3.xml"

At this point you need to use the PAGE Converter Java Tool to convert the HOCR xml into a PAGE xml. It should go a little something like this:

pctool="/Users/Me/Project/tools/JPageConverter 1.0" java -jar "$pctool/PageConverter.jar" -source-xml paper-3.xml -target-xml paper-3-hocrconvert.xml -convert-to LATEST

Unfortunately, I kept getting null pointers.

Could not convert to target XML schema format. java.lang.NullPointerException     at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:126)     at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65) Could not save target PAGE XML file: paper-3-hocrconvert.xml java.lang.NullPointerException     at org.primaresearch.dla.page.io.xml.XmlInputOutput.writePage(XmlInputOutput.java:144)     at org.primaresearch.dla.page.converter.PageConverter.run(PageConverter.java:135)     at org.primaresearch.dla.page.converter.PageConverter.main(PageConverter.java:65)

108

answered Oct 06 '22 02:10

James Owers

Related questions
                            
                                My own OCR-program in Python
                            
                                Where can I find a free .Net (C#) library that I can use to scan and OCR documents? [closed]
                            
                                Understanding Freeman chain codes for OCR
                            
                                Tesseract OCR simple example
                            
                                How can i use tesseract ocr(or any other free ocr) in small c++ project?
                            
                                iOS: Real Time OCR on top of live camera feed (similar to iTunes Redeem Gift Card)
                            
                                How to remove all lines and borders in an image while keeping text programmatically?
                            
                                Programmatically recognize text from scans in a PDF File [closed]
                            
                                What OCR options exist beyond Tesseract? [closed]
                            
                                Is there an OCR library that outputs coordinates of words found within an image? [closed]
                            
                                OCR with the Tesseract interface
                            
                                Which OCR Engine is better: Tesseract or OCRopus? [closed]
                            
                                Using Tesseract from java
                            
                                Preprocessing image for Tesseract OCR with OpenCV
                            
                                Is there an efficient algorithm for segmentation of handwritten text?
                            
                                Recognize a number from an image
                            
                                What is the ideal font for OCR?
                            
                                Using Microsoft OCR Library with JS/jQuery in VS 2013
                            
                                Character recognition (OCR algorithm) [closed]
                            
                                Using Tesseract for handwriting recognition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I segment a document using Tesseract then output the resulting bounding boxes and labels

Tags:

ocr

tesseract

hocr

Edit

James Owers

People also ask

1 Answers

Overview

Code

Results

Appendix

James Owers

Recent Activity

Donate For Us