Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get confidence of each line using pytesseract

I have successfully setup Tesseract and can translate the images to text...

text = pytesseract.image_to_string(Image.open(image))

However, I need to get the confidence value for every line. I cannot find a way to do this using pytesseract. Anyone know how to do this?

I know this is possible using PyTessBaseAPI, but I cannot use that, I've spent hours attempting to set it up with no luck, so I need a way to do this using pytesseract.

like image 984
buydadip Avatar asked Mar 28 '19 21:03

buydadip


2 Answers

After much searching, I have figured out a way. Instead of image_to_string, one should use image_to_data. However, this will give you statistics for each word, not each line...

text = pytesseract.image_to_data(Image.open(file_image), output_type='data.frame')

So what I did was saved it as a dataframe, and then used pandas to group by block_num, as each line is grouped into blocks using OCR, I also removed all rows with no confidence values (-1)...

text = text[text.conf != -1]
lines = text.groupby('block_num')['text'].apply(list)

Using this same logic, you can also calculate the confidence per line by calculating the mean confidence of all words within the same block...

conf = text.groupby(['block_num'])['conf'].mean()
like image 90
buydadip Avatar answered Oct 31 '22 11:10

buydadip


@Srikar Appalaraju is right. Take the following example image:

enter image description here

Now use the following code:

text = pytesseract.image_to_data(gray, output_type='data.frame')
text = text[text.conf != -1]
text.head()

enter image description here

Notice that all five rows have the same block_num, so that if we group by using that column, all the 5 words (texts) will be grouped together. But that's not what we want, we want to group only the first 3 words that belong to the first line and in order to do that properly (in a generic manner) for a large enough image we need to group by all the 4 columns page_num, block_num, par_num and line_num simulataneuosly, in order to compute the confidence for the first line, as shown in the following code snippet:

lines = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['text'] \
                                     .apply(lambda x: ' '.join(list(x))).tolist()
confs = text.groupby(['page_num', 'block_num', 'par_num', 'line_num'])['conf'].mean().tolist()
    
line_conf = []
    
for i in range(len(lines)):
    if lines[i].strip():
        line_conf.append((lines[i], round(confs[i],3)))

with the following desired output:

[('Ying Thai Kitchen', 91.667),
 ('2220 Queen Anne AVE N', 88.2),
 ('Seattle WA 98109', 90.333),
 ('« (206) 285-8424 Fax. (206) 285-8427', 83.167),
 ('‘uw .yingthaikitchen.com', 40.0),
 ('Welcome to Ying Thai Kitchen Restaurant,', 85.333),
 ('Order#:17 Table 2', 94.0),
 ('Date: 7/4/2013 7:28 PM', 86.25),
 ('Server: Jack (1.4)', 83.0),
 ('44 Ginger Lover $9.50', 89.0),
 ('[Pork] [24#]', 43.0),
 ('Brown Rice $2.00', 95.333),
 ('Total 2 iten(s) $11.50', 89.5),
 ('Sales Tax $1.09', 95.667),
 ('Grand Total $12.59', 95.0),
 ('Tip Guide', 95.0),
 ('TEK=$1.89, 18%=62.27, 20%=82.52', 6.667),
 ('Thank you very much,', 90.75),
 ('Cone back again', 92.667)]
like image 39
Sandipan Dey Avatar answered Oct 31 '22 09:10

Sandipan Dey