I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful.
As an example, I want to match a string on the words "New York" preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I'd like to match:
12 New York 24 Hew York 33 New Yobk
and other "close" matches (in the sense of the Levenshtein distance), but not:
aa New York 11 Detroit
Obviously, I will need to specify the allowable distance ("fuzziness") for the match.
As I understand it, I cannot use the String::Approx
Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).
Also, I should note that this is a very simplified example of what I'm really trying to match, so I'm not looking for a brute-force approach.
Edited to add:
Okay, my first example was too simple. I didn't mean for people to get hung up on the preceding digits -- sorry about the bad example. Here's a better example. Consider this string:
ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.
What this actually says is:
ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE
What I need to do is extract the phrase "ALUSCHALME&S MANOTAC/rURINGCOMPANY" and "DELAY/ABE". (I realize this might seem like madness. But I'm an optimist.) In general, the pattern will look something like this:
/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i
where the matching is fuzzy.
[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .
Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.
For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.
NLP methods such as Regex matching, fuzzywuzzy, and TF-IDF are popular solutions for entity resolution.
If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.
Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like "York", but if your typical pattern is a whole address, that should be fine.
Try it like this:
q=3
.NEW_YORK
becomes [NEW, EW_, W_Y, _YO, ORK]
, store this away with each text.If you did that you can tweak this algorithm by:
q-1
special chars, so even your short words will get a decent profile. For example New York
becomes ^^NEW YORK$$
.Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|)
(with |T|
and |P|
the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.
Regexes have specific rules, they aren't built for doing what you want. It's going to be much easier to make two passes at it. Use a regex to strip off the numbers and then use a module to get your match close.
Something like this (assuming your input is lines from a file)
while( my $line = <$fh> ) { chomp $line; # do we have digits? if( $line =~ /^\d+/ ) { # removes spaces and digits from the beginning of the line $line =~ s/^[\d\s]*//g; # use your module to determine if you have a match in the remaining text. if( module_match ) { # do something } else { #no match } } else { # no match } }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With