I need to do some OCR on a large chunk of text and check if it contains a certain string but due to the inaccuracy of the OCR I need it to check if it contains something like a ~85% match for the string.
For example I may OCR a chunk of text to make sure it doesn't contain no information available
but the OCR might see n0 inf0rmation available
or misinterpret an number of characters.
Is there an easy way to do this in Python?
To calculate the percentage between two numbers, divide one number by the other and multiply the result by 100, e.g. (30 / 75) * 100 . This shows what percent the first number is of the second. In the example, 30 is 40% of 75 .
In Python, we can represent an integer value in the form of string.
Almost any value is evaluated to True if it has some sort of content. Any string is True , except empty strings. Any number is True , except 0 . Any list, tuple, set, and dictionary are True , except empty ones.
As posted by gauden
, SequenceMatcher
in difflib
is an easy way to go. Using ratio()
, returns a value between 0
and 1
corresponding to the similarity between the two strings, from the docs:
Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.
example:
>>> import difflib
>>> difflib.SequenceMatcher(None,'no information available','n0 inf0rmation available').ratio()
0.91666666666666663
There is also get_close_matches
, which might be useful to you, you can specify a distance cutoff and it'll return all matches within that distance from a list:
>>> difflib.get_close_matches('unicorn', ['unicycle', 'uncorn', 'corny',
'house'], cutoff=0.8)
['uncorn']
>>> difflib.get_close_matches('unicorn', ['unicycle' 'uncorn', 'corny',
'house'], cutoff=0.5)
['uncorn', 'corny', 'unicycle']
Update: to find a partial sub-sequence match
To find close matches to a three word sequence, I would split the text into words, then group them into three word sequences, then apply difflib.get_close_matches
, like this:
import difflib
text = "Here is the text we are trying to match across to find the three word
sequence n0 inf0rmation available I wonder if we will find it?"
words = text.split()
three = [' '.join([i,j,k]) for i,j,k in zip(words, words[1:], words[2:])]
print difflib.get_close_matches('no information available', three, cutoff=0.9)
#Oyutput:
['n0 inf0rmation available']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With