Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I do a "string contains X" with a percentage accuracy in python?

I need to do some OCR on a large chunk of text and check if it contains a certain string but due to the inaccuracy of the OCR I need it to check if it contains something like a ~85% match for the string.

For example I may OCR a chunk of text to make sure it doesn't contain no information available but the OCR might see n0 inf0rmation available or misinterpret an number of characters.

Is there an easy way to do this in Python?

like image 740
Jacxel Avatar asked Jun 01 '12 11:06

Jacxel


People also ask

How do you compare percentages in Python?

To calculate the percentage between two numbers, divide one number by the other and multiply the result by 100, e.g. (30 / 75) * 100 . This shows what percent the first number is of the second. In the example, 30 is 40% of 75 .

Can a value be a string python?

In Python, we can represent an integer value in the form of string.

Does a string evaluate to true in Python?

Almost any value is evaluated to True if it has some sort of content. Any string is True , except empty strings. Any number is True , except 0 . Any list, tuple, set, and dictionary are True , except empty ones.


1 Answers

As posted by gauden, SequenceMatcher in difflib is an easy way to go. Using ratio(), returns a value between 0 and 1 corresponding to the similarity between the two strings, from the docs:

Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.

example:

>>> import difflib
>>> difflib.SequenceMatcher(None,'no information available','n0 inf0rmation available').ratio()
0.91666666666666663

There is also get_close_matches, which might be useful to you, you can specify a distance cutoff and it'll return all matches within that distance from a list:

>>> difflib.get_close_matches('unicorn', ['unicycle', 'uncorn', 'corny', 
                              'house'], cutoff=0.8)
['uncorn']
>>> difflib.get_close_matches('unicorn', ['unicycle'  'uncorn', 'corny',
                              'house'], cutoff=0.5)
['uncorn', 'corny', 'unicycle']

Update: to find a partial sub-sequence match

To find close matches to a three word sequence, I would split the text into words, then group them into three word sequences, then apply difflib.get_close_matches, like this:

import difflib
text = "Here is the text we are trying to match across to find the three word
        sequence n0 inf0rmation available I wonder if we will find it?"    
words = text.split()
three = [' '.join([i,j,k]) for i,j,k in zip(words, words[1:], words[2:])]
print difflib.get_close_matches('no information available', three, cutoff=0.9)
#Oyutput:
['n0 inf0rmation available']
like image 119
fraxel Avatar answered Oct 03 '22 00:10

fraxel