Fuzzy regular expressions

Tags:

I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful.

As an example, I want to match a string on the words "New York" preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I'd like to match:

12 New York 24 Hew York 33 New Yobk

and other "close" matches (in the sense of the Levenshtein distance), but not:

aa New York 11 Detroit

Obviously, I will need to specify the allowable distance ("fuzziness") for the match.

As I understand it, I cannot use the String::Approx Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).

Also, I should note that this is a very simplified example of what I'm really trying to match, so I'm not looking for a brute-force approach.

Edited to add:

Okay, my first example was too simple. I didn't mean for people to get hung up on the preceding digits -- sorry about the bad example. Here's a better example. Consider this string:

ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.

What this actually says is:

ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE

What I need to do is extract the phrase "ALUSCHALME&S MANOTAC/rURINGCOMPANY" and "DELAY/ABE". (I realize this might seem like madness. But I'm an optimist.) In general, the pattern will look something like this:

/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i

where the matching is fuzzy.

234

asked Nov 11 '10 15:11

itzy

2 Answers

If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like "York", but if your typical pattern is a whole address, that should be fine.

Try it like this:

transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by "#" or something.
choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
than, build a qgram-profile of each text:
split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
if you search for your pattern then, you do the same with your pattern,
loop through your text-qgram-database and
- count for each pattern/text-pair how many qgrams are the same.
- each hit will raise the score by 1.
the texts with the highest score(s) are your best hits.

If you did that you can tweak this algorithm by:

prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
You can even play around with replacing all consonants with "x" and vowels with "o" and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
store 2-grams and 3-grams both, and when counting, weigh then differently.

Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|) (with |T| and |P| the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.

answered Sep 18 '22 06:09

towi

Regexes have specific rules, they aren't built for doing what you want. It's going to be much easier to make two passes at it. Use a regex to strip off the numbers and then use a module to get your match close.

Something like this (assuming your input is lines from a file)

while( my $line = <$fh> ) {     chomp $line;      # do we have digits?     if( $line =~ /^\d+/ ) {          # removes spaces and digits from the beginning of the line          $line =~ s/^[\d\s]*//g;           # use your module to determine if you have a match in the remaining text.          if( module_match ) {              # do something          }          else {              #no match          }     }     else {         # no match     } }

answered Sep 20 '22 06:09

Cfreak

Related questions
                            
                                How to determine if a File Matches a File Mask?
                            
                                regex find characters between " "
                            
                                Extract float/double value
                            
                                Regex to accept alphanumeric and some special character in Javascript? [closed]
                            
                                How can I match the start and end in Python's regex?
                            
                                postgres regexp_replace want to allow only a-z and A-Z
                            
                                Regex Expressions for all non alphanumeric symbols
                            
                                Can I substitute multiple items in a single regular expression in VIM or Perl?
                            
                                regex extract email from strings
                            
                                Search and replace in Visual Studio
                            
                                Best way to get all digits from a string [duplicate]
                            
                                Validation for URL/Domain using Regex? (Rails)
                            
                                What does (?!a){0}? mean in a Java regex?
                            
                                Difference between regex_match and regex_search?
                            
                                Why can regular expressions have an exponential running time?
                            
                                Javascript regex to validate IPv4 and IPv6 address, no hostnames
                            
                                regex for n characters or at least m characters
                            
                                Perl Regex "Not" (negative lookahead)
                            
                                What does this regexp mean - "\p{Lu}"?
                            
                                How to escape asterisk in regexp?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fuzzy regular expressions

Tags:

regex

perl

fuzzy-comparison

itzy

People also ask

2 Answers

towi

Cfreak

Recent Activity

Donate For Us