Most commonly misspelled English words are within two or three typographic errors (a combination of substitutions s, insertions i, or letter deletions d) from their correct form. I.e. errors in the word pair absence - absense
can be summarized as having 1 s, 0 i and 0 d.
One can fuzzy match to find words and their misspellings using the to-replace-re regex python module.
The following table summarizes attempts made to fuzzy segment a word of interest from some sentence:
word
match in sentence
allowing at most 2
errorsword
match in sentence
allowing at
most 2 errors while trying to operate only on (I think) whole wordsword
match in sentence
allowing at
most 2 errors while operating only on (I think) whole words. I'm wrong somehow.word
match in sentence
allowing at
most 2 errors while (I think) looking for the end of the match to be a word boundaryHow would I write a regex expression that eliminates, if possible, false positive and false negative fuzzy matches on these word-sentence pairs?
A possible solution would be to only compare words (strings of characters surrounded by white space or the beginning/end of a line) in the sentence to the word of interest (principal word). If there's a fuzzy match (e<=2) between the principal word and a word in the sentence, then return that full word (and only that word) from the sentence.
Copy the following dataframe to your clipboard:
word sentence
0 cub cadet cub cadet 42
1 plastex vinyl panels
2 spt heat and air conditioner
3 closetmaid closetmaid
4 ryobi batteries kyobi
5 ryobi 10' table saw ryobi
6 trafficmaster traffic mast5er
Now use
import pandas as pd, regex
df=pd.read_clipboard(sep='\s\s+')
test=df
test['(?b)(?:WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?b)(?:\wWORD\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?b)(?:\w'+x['word']+'\W){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:\w&&WORD){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:\w&&'+x['word']+'){e<=2}', x['sentence']),axis=1)
test['(?V1)(?b)(?:WORD&&\W){e<=2}']=df.apply(lambda x: regex.findall(r'(?V1)(?b)(?:'+x['word']+'&&\W){e<=2}', x['sentence']),axis=1)
To load the table into your environment.
Do '(?b)\m(?:WORD){e<=2}\M'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With