Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I fuzzy match word to a full word (and only full word) in a sentence?

Most commonly misspelled English words are within two or three typographic errors (a combination of substitutions s, insertions i, or letter deletions d) from their correct form. I.e. errors in the word pair absence - absense can be summarized as having 1 s, 0 i and 0 d.

One can fuzzy match to find words and their misspellings using the to-replace-re regex python module.

The following table summarizes attempts made to fuzzy segment a word of interest from some sentence:

enter image description here

  • Regex1 finds the best word match in sentence allowing at most 2 errors
  • Regex2 finds the best word match in sentence allowing at most 2 errors while trying to operate only on (I think) whole words
  • Regex3 finds the best word match in sentence allowing at most 2 errors while operating only on (I think) whole words. I'm wrong somehow.
  • Regex4 finds the best word match in sentence allowing at most 2 errors while (I think) looking for the end of the match to be a word boundary

How would I write a regex expression that eliminates, if possible, false positive and false negative fuzzy matches on these word-sentence pairs?

A possible solution would be to only compare words (strings of characters surrounded by white space or the beginning/end of a line) in the sentence to the word of interest (principal word). If there's a fuzzy match (e<=2) between the principal word and a word in the sentence, then return that full word (and only that word) from the sentence.

Code

Copy the following dataframe to your clipboard:

            word                  sentence
0      cub cadet              cub cadet 42
1        plastex              vinyl panels
2            spt  heat and air conditioner
3     closetmaid                closetmaid
4          ryobi           batteries kyobi
5          ryobi       10' table saw ryobi
6  trafficmaster           traffic mast5er

Now use

import pandas as pd, regex
df=pd.read_clipboard(sep='\s\s+')

test=df
test['(?b)(?:WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?b)(?:\wWORD\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:\w'+x['word']+'\W){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:\w&&WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:\w&&'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:WORD&&\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:'+x['word']+'&&\W){e<=2}', x['sentence']),axis=1)

To load the table into your environment.

like image 983
zelusp Avatar asked Apr 25 '16 23:04

zelusp


1 Answers

Do '(?b)\m(?:WORD){e<=2}\M'

like image 191
zelusp Avatar answered Oct 11 '22 23:10

zelusp