How do I fuzzy match word to a full word (and only full word) in a sentence?

Tags:

Most commonly misspelled English words are within two or three typographic errors (a combination of substitutions s, insertions i, or letter deletions d) from their correct form. I.e. errors in the word pair absence - absense can be summarized as having 1 s, 0 i and 0 d.

One can fuzzy match to find words and their misspellings using the to-replace-re regex python module.

The following table summarizes attempts made to fuzzy segment a word of interest from some sentence:

enter image description here

Regex1 finds the best word match in sentence allowing at most 2 errors
Regex2 finds the best word match in sentence allowing at most 2 errors while trying to operate only on (I think) whole words
Regex3 finds the best word match in sentence allowing at most 2 errors while operating only on (I think) whole words. I'm wrong somehow.
Regex4 finds the best word match in sentence allowing at most 2 errors while (I think) looking for the end of the match to be a word boundary

How would I write a regex expression that eliminates, if possible, false positive and false negative fuzzy matches on these word-sentence pairs?

A possible solution would be to only compare words (strings of characters surrounded by white space or the beginning/end of a line) in the sentence to the word of interest (principal word). If there's a fuzzy match (e<=2) between the principal word and a word in the sentence, then return that full word (and only that word) from the sentence.

Code

Copy the following dataframe to your clipboard:

            word                  sentence
0      cub cadet              cub cadet 42
1        plastex              vinyl panels
2            spt  heat and air conditioner
3     closetmaid                closetmaid
4          ryobi           batteries kyobi
5          ryobi       10' table saw ryobi
6  trafficmaster           traffic mast5er

Now use

import pandas as pd, regex
df=pd.read_clipboard(sep='\s\s+')

test=df
test['(?b)(?:WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?b)(?:\wWORD\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:\w'+x['word']+'\W){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:\w&&WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:\w&&'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:WORD&&\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:'+x['word']+'&&\W){e<=2}', x['sentence']),axis=1)

To load the table into your environment.

983

asked Apr 25 '16 23:04

zelusp

1 Answers

Do '(?b)\m(?:WORD){e<=2}\M'

191

answered Oct 11 '22 23:10

zelusp

Related questions
                            
                                Mezzanine - Can't load css and js in Heroku
                            
                                conditional graph in tensorflow and for loop that accesses tensor size
                            
                                python-requests post with unicode filenames
                            
                                Get the diff details of first commit in GitPython
                            
                                How to detect system ACPI G2/S5 Soft Off event with python on linux
                            
                                Scikit - Combining scale and grid search
                            
                                What is the Python way of doing a \G anchored parsing loop?
                            
                                Static URL in cherrypy
                            
                                Read xlsx stored on sharepoint location with openpyxl in python?
                            
                                Received Print Job Python
                            
                                How to reuse a process pool for parallel programming in Python 3
                            
                                How do I increase the spacing between subplots with subplot2grid?
                            
                                Weird behaviour of boto inside docker
                            
                                Dictionary with range as key
                            
                                How to make imports / closures work from IPython's embed?
                            
                                django oscar and djangocms
                            
                                Linear shift between 2 sets of coordinates
                            
                                Dynamic self-referencing conditional in list comprehension
                            
                                Python: Matplotlib Surface_plot
                            
                                Pyinstaller automatically includes unneeded modules

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I fuzzy match word to a full word (and only full word) in a sentence?

Tags:

python

regex

fuzzy-search

Code

zelusp

1 Answers

zelusp

Recent Activity

Donate For Us