Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy Regular Expressions

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

Do you have any ideas how to do this effectively?

like image 674
Thomas Ahle Avatar asked Feb 28 '10 16:02

Thomas Ahle


People also ask

What is meant by fuzzy matching?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same.

What is fuzzy search example?

For example, if a user types "Misissippi" into Yahoo or Google -- both of which use fuzzy matching -- a list of hits is returned along with the question, "Did you mean Mississippi?" Alternative spellings and words that sound the same but are spelled differently are given.

Is Fuzzy Wuzzy NLP?

FuzzyWuzzy Python Library: Interesting Tool for NLP and Text Analytics.

What is Fuzzy Wuzzy in Python?

Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.


1 Answers

I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/ It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan TVSchedule for Jan 10 T Schedule for 10 Jan 2010 TV Schedule for 10 March Tv plan for March 

and got

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)$' filename 1:TV Schedule for 10Jan 8:TVSchedule for Jan 10 7:T Schedule for 10 Jan 2010 3:TV Schedule for 10 March 15:Tv plan for March 

Thanks a lot for all your suggestions.

like image 77
Thomas Ahle Avatar answered Sep 23 '22 18:09

Thomas Ahle