Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy regular expressions

I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful.

As an example, I want to match a string on the words "New York" preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I'd like to match:

12 New York 24 Hew York 33 New Yobk 

and other "close" matches (in the sense of the Levenshtein distance), but not:

aa New York 11 Detroit 

Obviously, I will need to specify the allowable distance ("fuzziness") for the match.

As I understand it, I cannot use the String::Approx Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).

Also, I should note that this is a very simplified example of what I'm really trying to match, so I'm not looking for a brute-force approach.

Edited to add:

Okay, my first example was too simple. I didn't mean for people to get hung up on the preceding digits -- sorry about the bad example. Here's a better example. Consider this string:

ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.

What this actually says is:

ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE

What I need to do is extract the phrase "ALUSCHALME&S MANOTAC/rURINGCOMPANY" and "DELAY/ABE". (I realize this might seem like madness. But I'm an optimist.) In general, the pattern will look something like this:

/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i

where the matching is fuzzy.

like image 234
itzy Avatar asked Nov 11 '10 15:11

itzy


People also ask

What is difference [] and () in RegEx?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What is meant by fuzzy matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.

What is fuzzy search example?

For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.

Is Fuzzy Wuzzy NLP?

NLP methods such as Regex matching, fuzzywuzzy, and TF-IDF are popular solutions for entity resolution.


2 Answers

If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like "York", but if your typical pattern is a whole address, that should be fine.

Try it like this:

  • transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by "#" or something.
  • choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
  • than, build a qgram-profile of each text:
  • split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
  • if you search for your pattern then, you do the same with your pattern,
  • loop through your text-qgram-database and
    • count for each pattern/text-pair how many qgrams are the same.
    • each hit will raise the score by 1.
  • the texts with the highest score(s) are your best hits.

If you did that you can tweak this algorithm by:

  • prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
  • You can even play around with replacing all consonants with "x" and vowels with "o" and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
  • when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
  • store 2-grams and 3-grams both, and when counting, weigh then differently.

Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|) (with |T| and |P| the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.

like image 97
towi Avatar answered Sep 18 '22 06:09

towi


Regexes have specific rules, they aren't built for doing what you want. It's going to be much easier to make two passes at it. Use a regex to strip off the numbers and then use a module to get your match close.

Something like this (assuming your input is lines from a file)

while( my $line = <$fh> ) {     chomp $line;      # do we have digits?     if( $line =~ /^\d+/ ) {          # removes spaces and digits from the beginning of the line          $line =~ s/^[\d\s]*//g;           # use your module to determine if you have a match in the remaining text.          if( module_match ) {              # do something          }          else {              #no match          }     }     else {         # no match     } } 
like image 20
Cfreak Avatar answered Sep 20 '22 06:09

Cfreak