Tesseract 3 is able to perform page layout analysis. However, I couldn't find any sample code or documentation on how to use the library for such purposes. I hope someone here can explain how to perform layout analysis on an image and how to parse the resulting data.
There is an option since 3.04:
tesseract -c preserve_interword_spaces=1 test.tif test
Here is a reference to what looks like the related development thread.
Tesseract can be given a page mode parameter (-psm
) which can have the following values:
0
= Orientation and script detection (OSD) only.1
= Automatic page segmentation with OSD.2
= Automatic page segmentation, but no OSD, or OCR3
= Fully automatic page segmentation, but no OSD. (Default)4
= Assume a single column of text of variable sizes.5
= Assume a single uniform block of vertically aligned text.6
= Assume a single uniform block of text.7
= Treat the image as a single text line.8
= Treat the image as a single word.9
= Treat the image as a single word in a circle.10
= Treat the image as a single character.Example:
tesseract image.tif image.txt -l eng -psm 0
However, I am not sure that it is possible to use the layout analysis in standalone mode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With