Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace misspelled words in a pandas dataframe

I have 2 pandas DataFrames. One containing a list of properly spelled words:

[In]: df1
[Out]:
   words
0  apple
1  phone
2  clock
3  table
4  clean

and one with misspelled words:

[In]: df2
[Out]:
   misspelled
0        aple
1         phn
2        alok
3     garbage
4        appl
5         pho

The goal is to replace the column of misspelled words in the second DataFrame using the list of correctly spelled words from the first DataFrame. The second DataFrame can have multiple repetitions, can be a different size than the first, can have words that aren't in the first DataFrame (or aren't similar enough to match).

I've been trying to use difflib.get_close_matches with some success, but it does not work out perfectly.

This is what I have so far:

x = list(map(lambda x: get_close_matches(x, df1.col1), df2.col1))
good_words = list(map(''.join, x))
l = np.array(good_words, dtype='object')
df2.col1 = pd.Series(l)
df2 = df2[df2.col1 != '']

After applying the transformation, I should get the second DataFrame to look like:

[In]: df2
[Out]:
          0
0     apple
1     phone
2     clock
3       NaN
4     apple
5     phone

If no match is found the row gets replaced with NaN. My problem is that I get a result that looks like this:

[In]: df2
[Out]:
    misspelled
0        apple
1        phone
2   clockclean
3          NaN
4        apple
5        phone

At this time of writing I have not figured out why some of the words are combined. I suspect it has something to do with difflib.get_close_matches matching different words that are similar in length and/or lettering. So far I get aroun ~10% - 15% of the words combined like this out of a whole column. Thanks in advance.

like image 653
Stealing Avatar asked Mar 04 '23 21:03

Stealing


1 Answers

If want match first value returned by get_close_matches, the cutoff parameter can be adjusted based on your desired threshold, use next with iter for possible add value if no match - here np.nan:

x = [next(iter(x), np.nan) 
          for x in map(lambda x: difflib.get_close_matches(x, df1.words, cutoff = 0.6), df2.misspelled)]
df2['col1'] = x

print (df2)
  misspelled   col1
0       aple  apple
1        phn  phone
2       alok  clock
3    garbage    NaN
4       appl  apple
5        pho  phone
like image 196
jezrael Avatar answered Mar 12 '23 00:03

jezrael