Get text position with tesseract 2.04 and Java

Question

I'm performing OCR using Tesseract 2.04 in some images, and now i've to get the precise position of the text ocearized. But this version don't return this information.

I need this to generate a searchable pdf file. I already learned how to stamp a text in a under layer of the pdf, but i need the position to stamp this text. My first idea is perform ocr in the pdf, getting the text and position of text, to stamp in the pdf with iText api.

Jake Frederix · Accepted Answer

Internally at iText we have also looked into OCR. And it is possible (using Tesseract).

workflow:

extract all images from the pdf using iText
extract the text (and coordinates, font, etc) using Tesseract
apply coordinate transformations (since tesseract coordinate system and iText coordinate system are not the same)
add a layer to the pdf (canvas.beginLayer)
draw all text in this layer on the correct position

There are many more optimizations you could do. A short list of suggestions:

correct baseline
correct font
correct spelling mistakes
estimate color
estimate background color

This is not an easy task. But certainly possible.

Get text position with tesseract 2.04 and Java

Tags:

java

pdf

ocr

tesseract

itext

Raduan Santos

1 Answers

Jake Frederix

Recent Activity

Donate For Us

Get text position with tesseract 2.04 and Java

Tags:

java

pdf

ocr

tesseract

itext

Raduan Santos

1 Answers

Jake Frederix

Related questions

Recent Activity

Donate For Us