Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to preserve all whitespaces from an image when doing text extraction using tesseract-4.0?

I am working on extracting tabular text from images using tesseract-ocr 4.0 and exporting the results in an excel while maintaining the alignment of the data.

I want to keep all the spaces as it is in the image in the extracted table. But OCR skips lot of leading and trailing spaces and removes them.

I have images where at certain places empty space occurs in the table. I have used preserve whitespaces option in tesseract but still OCR skips a lot of empty spaces.

Is there a way to detect or preserve all the empty spaces from the table when extraction using OCR? Or is there any technique to detect empty spaces using distance measurement in a table?

Attaching the image for the same :

enter image description here

like image 276
Apurva Kumar Singh Avatar asked Nov 19 '25 09:11

Apurva Kumar Singh


1 Answers

I think you should upgrade your tesseract to version 5 and use "-c preserve_interword_spaces=1" to preserve whitespace. But maybe you must do post processing because the output might not meet your expectation.

EDITED

Your question is similar to this. But since I can not use it directly, I made little modification to it. Credit goes to igrinis.

import cv2
import pytesseract
from pytesseract import Output
import pandas as pd

img = cv2.imread("bsShN.jpg", cv2.COLOR_BGR2GRAY)
gauss = cv2.GaussianBlur(img, (3, 3), 0)

custom_config = r' -l eng --oem 1 --psm 6  -c preserve_interword_spaces=1 -c tessedit_char_whitelist="0123456789- " '
d = pytesseract.image_to_data(gauss, config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

# clean up blanks
df1 = df[(df.conf != '-1') & (df.text != ' ') & (df.text != '')]

# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num'] == block]
    sel = curr[curr.text.str.len() > 3]
    char_w = (sel.width / sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # add new line when necessary
        if prev_par != ln['par_num']:
            text += '\n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += '\n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # num of spaces that should be added
        if ln['left'] / char_w > prev_left + 1:
            added = int((ln['left']) / char_w) - prev_left
            text += ' ' * added
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += '\n'
    print(text)

Here is the output. Not all digits recognized correctly. Spaces must also be fixed in some places.

  56     0   232                  35                                    197    19363 
   0     3    22                  10                                     12     1586 
  60  200    165                   0                                    165    11626 
  44  345     69     50    610    75                                     54     7593 
  52  789    191    480    96    618                                    149     6324 
  84    71    34     50    8610   20                                     74     4837 
  77    680- 131                  61                      1              71     3000 
  11     6   103                   0                                    103     9932 
   2    52    29                   3                                     26     4451 
  12    65    23                   4                                     19     1626 
  24    62          100           10                              -1     90     6621 
497   897     63    360          292        100     0                    31     3056 
863  1285    331     50          197         50     0                   134    17037 
   0     5    24                   2                                     22     3159 
  15   131   144                  47                                     97    15070 
  44    61    86     44     4    112                                     22     1320 
  10    90    85     50          135                                      0        0 
   3     8    54                  11                              -9     43     2334
like image 98
us2018 Avatar answered Nov 20 '25 22:11

us2018