Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.
>>> import difflib >>> difflib.SequenceMatcher(None,"amazing","amaging").ratio() 0.8571428571428571
The fuzzy matches can be detected by deciding a threshold as needed.
Current requirement : To find fuzzy substring based on a threshold in a bigger string.
eg.
large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" #result = "manhatan","manhattin" and their indexes in large_string
One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.
Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .
---------------------UPDATE AND SOLUTION ----------------
Python regex module works pretty well, though it is little bit slower than inbuilt re
module for fuzzy substring cases, which is an obvious outcome due to extra operations. The desired output is good and the control over magnitude of fuzziness can be easily defined.
>>> import regex >>> input = "Monalisa was painted by Leonrdo da Vinchi" >>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE) <regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>
From 3.7 hours to 0.2 seconds. How to perform intelligent string matching in a way that can scale to even the biggest data sets. Same but different. Fuzzy matching of data is an essential first-step for a huge range of data science workflows.
One of the most effective ways to calculate scores for a fuzzy string matching algorithm is by using cosine similarity. The cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors.
Fuzzywuzzy is a python library that uses Levenshtein Distance to calculate the differences between sequences and patterns that was developed and also open-sourced by SeatGeek, a service that finds event tickets from all over the internet and showcase them on one platform.
The new regex library that's soon supposed to replace re includes fuzzy matching.
https://pypi.python.org/pypi/regex/
The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.
import regex regex.match('(amazing){e<=1}', 'amaging')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With