I have an image that contains textual information and:
the problem is : sometimes the ocr extracts a value with some symbols in it so it doesn't match the pattern example : for the pattern date i have :
pattern = "(0[1-9]|[12][0-9]|3[01])/(0[1-9]|1[012])/(19|20)\d\d"
the value from the image is
12/02/2014
but the OCR extracted :
12? /02 -2014
i want to get the similarity between the pattern and the value extracted ( to treat it lately ) is there a way to do that without changing the pattern ?
A specific regex cannot be used to match a pattern with ambiguities without modifications that allow such ambiguities. For example, if you would like to allow an insertion of extra characters in arbitrary spots of the string being matched, the regex pattern would need to have provisions that much these arbitrary characters. This makes the pattern ugly very quickly: for example, while a pattern for matching an int
is very simple,
\\d+
the same pattern that allows non-digits in between would look like this:
(\\d\\D*)+
This gets uglier and uglier as the pattern gets larger, so this approach is not very good.
I would recommend replacing a pattern-based matching with an algorithm that implements a variation of Levenshtein distance.
The original Levenshtein distance algorithm takes two strings, and returns the number of modifications that need to be done on one string in order to get the other one. Your algorithm should take a string and a pattern. The pattern should use some sort of a designator for digits (say, #
) and treat all other characters "literally", as string characters. You would modify the indicator function used in the algorithm to return zero when you send it a #
and any digit, and 1
otherwise.
Take a look at the implementation with two matrix rows, it is the most space-efficient. The indicator function is implemented on this line:
var cost = (s[i] == t[j]) ? 0 : 1;
Changing it to
int cost = (s[i] == t[j] || (Character.isDigit(s[i]) && t[j] == '#')) ? 0 : 1;
would allow you to "match" digits. Your code could also remove all whitespace from the string before doing the match.
You could decide on the quality of the match by checking the Levenshtein distance. A distance of zero shows a perfect match; a distance of one or two is pretty good for short patterns; a distance of five or more is probably unacceptable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With