I am using Tesseract OCR to convert scanned PDF's into plain text. Overall it is highly effective but I am having issues with the order that the text is scanned. Documents with tabular data seem to scan down column by column when it seems like the more natural way would be to scan row by row. A very small scale example would be:
This is column A, row 1 This is column B, row 1 This is column C, row 1
This is column A, row 2 This is column B, row 2 This is column C, row 2
Is yielding the following text:
This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2
I am starting to read documentation and do a guess and test, brute force approach with parameters documented here but if someone has already tackled an issue similar, I would appreciate the insight on the fix. It could also be some training data but I do not know exactly how that works.
Try running tesseract in one of the single column Page Segmentation Modes:
tesseract input.tif output-filename --psm 6
By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the
-psm
argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.To see a complete list of supported page segmentation modes, use
tesseract -h
. Here's the [ed: excerpt only] list as of 3.21:
- Fully automatic page segmentation, but no OSD. (Default)
- Assume a single column of text of variable sizes.
- Assume a single uniform block of vertically aligned text.
- Assume a single uniform block of text.
See examples here: #using-different-page-segmentation-modes
I know this is an old question, but I've been struggling with a similar issue and found hOCR output to be the solution. Running
tesseract input.tif output-filename hocr
will create output-file.hocr
(basically HTML) that gives coordinates for the bounding boxes of each phrase. It's up to you to determine how to reconstruct the table from this data (probably using the dimensions of the input image).
As in the other answers, specifying some particular page segmentation mode might be useful in getting the phrases of your table grouped appropriately, but the coordinates will provide the precise result needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With