Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking fuzzy/approximate substring existing in a longer string, in Python?

Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.

>>> import difflib >>> difflib.SequenceMatcher(None,"amazing","amaging").ratio() 0.8571428571428571 

The fuzzy matches can be detected by deciding a threshold as needed.

Current requirement : To find fuzzy substring based on a threshold in a bigger string.

eg.

large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" #result = "manhatan","manhattin" and their indexes in large_string 

One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.

Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .

---------------------UPDATE AND SOLUTION ----------------

Python regex module works pretty well, though it is little bit slower than inbuilt re module for fuzzy substring cases, which is an obvious outcome due to extra operations. The desired output is good and the control over magnitude of fuzziness can be easily defined.

>>> import regex >>> input = "Monalisa was painted by Leonrdo da Vinchi" >>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE) <regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)> 
like image 306
DhruvPathak Avatar asked Jul 19 '13 07:07

DhruvPathak


People also ask

How long is fuzzy matching?

From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets. Same but different. Fuzzy matching of data is an essential first-step for a huge range of data science workflows.

How do you evaluate a fuzzy match?

One of the most effective ways to calculate scores for a fuzzy string matching algorithm is by using cosine similarity. The cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors.

What is FuzzyWuzzy in python?

Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.


1 Answers

The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex regex.match('(amazing){e<=1}', 'amaging') 
like image 171
mgbelisle Avatar answered Sep 20 '22 13:09

mgbelisle