Checking fuzzy/approximate substring existing in a longer string, in Python?

Tags:

Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.

>>> import difflib >>> difflib.SequenceMatcher(None,"amazing","amaging").ratio() 0.8571428571428571

The fuzzy matches can be detected by deciding a threshold as needed.

Current requirement : To find fuzzy substring based on a threshold in a bigger string.

eg.

large_string = "thelargemanhatanproject is a great project in themanhattincity" query_string = "manhattan" #result = "manhatan","manhattin" and their indexes in large_string

One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.

Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .

---------------------UPDATE AND SOLUTION ----------------

Python regex module works pretty well, though it is little bit slower than inbuilt re module for fuzzy substring cases, which is an obvious outcome due to extra operations. The desired output is good and the control over magnitude of fuzziness can be easily defined.

>>> import regex >>> input = "Monalisa was painted by Leonrdo da Vinchi" >>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE) <regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

306

asked Jul 19 '13 07:07

DhruvPathak

1 Answers

The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex regex.match('(amazing){e<=1}', 'amaging')

171

answered Sep 20 '22 13:09

mgbelisle

Related questions
                            
                                How to extract xml attribute using Python ElementTree
                            
                                list to dictionary conversion with multiple values per key?
                            
                                virtualenv: Specifing which packages to use system-wide vs local [duplicate]
                            
                                Python - How do I convert "an OS-level handle to an open file" to a file object?
                            
                                Overriding a static method in python
                            
                                Python - IOError: [Errno 13] Permission denied:
                            
                                Why does `None is None is None` return True? [duplicate]
                            
                                Python: slicing a multi-dimensional array
                            
                                How to copy/paste DataFrame from Stack Overflow into Python
                            
                                Sublime Text 2 console input [duplicate]
                            
                                What does the "fit" method in scikit-learn do? [closed]
                            
                                Can you migrate backwards to before the first migration in South?
                            
                                Can Mustache Templates do template extension?
                            
                                TypeError: expected a character buffer object - while trying to save integer to textfile
                            
                                Yield multiple values
                            
                                Can you patch *just* a nested function with closure, or must the whole outer function be repeated?
                            
                                Apache Spark -- Assign the result of UDF to multiple dataframe columns
                            
                                Extension methods in Python
                            
                                anaconda - path environment variable in windows
                            
                                Using 'in' to match an attribute of Python objects in an array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Checking fuzzy/approximate substring existing in a longer string, in Python?

Tags:

python

python-2.7

fuzzy-search

DhruvPathak

People also ask

1 Answers

mgbelisle

Recent Activity

Donate For Us