Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tesseract OCR text order for documents with tables or rows

Tags:

ocr

tesseract

I am using Tesseract OCR to convert scanned PDF's into plain text. Overall it is highly effective but I am having issues with the order that the text is scanned. Documents with tabular data seem to scan down column by column when it seems like the more natural way would be to scan row by row. A very small scale example would be:

This is column A, row 1   This is column B, row 1    This is column C, row 1
This is column A, row 2   This is column B, row 2    This is column C, row 2

Is yielding the following text:

This is column A, row 1
This is column A, row 2
This is column B, row 1
This is column B, row 2
This is column C, row 1
This is column C, row 2

I am starting to read documentation and do a guess and test, brute force approach with parameters documented here but if someone has already tackled an issue similar, I would appreciate the insight on the fix. It could also be some training data but I do not know exactly how that works.

like image 694
derdc Avatar asked Mar 16 '15 22:03

derdc


2 Answers

Try running tesseract in one of the single column Page Segmentation Modes:

tesseract input.tif output-filename --psm 6

By default Tesseract expects a page of text when it segments an image. If you're just seeking to OCR a small region try a different segmentation mode, using the -psm argument. Note that adding a white border to text which is too tightly cropped may also help, see issue 398.

To see a complete list of supported page segmentation modes, use tesseract -h. Here's the [ed: excerpt only] list as of 3.21:

  1. Fully automatic page segmentation, but no OSD. (Default)
  2. Assume a single column of text of variable sizes.
  3. Assume a single uniform block of vertically aligned text.
  4. Assume a single uniform block of text.

See examples here: #using-different-page-segmentation-modes

like image 69
ptim Avatar answered Dec 27 '22 13:12

ptim


I know this is an old question, but I've been struggling with a similar issue and found hOCR output to be the solution. Running

tesseract input.tif output-filename hocr

will create output-file.hocr (basically HTML) that gives coordinates for the bounding boxes of each phrase. It's up to you to determine how to reconstruct the table from this data (probably using the dimensions of the input image).

As in the other answers, specifying some particular page segmentation mode might be useful in getting the phrases of your table grouped appropriately, but the coordinates will provide the precise result needed.

like image 42
R. Shafer Avatar answered Dec 27 '22 12:12

R. Shafer