Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by fuzzy string matches with fuzzywuzzy and groupby

I have a dataset of random words and names and I am trying to group all of the similar words and names. So given the dataframe below:

     Name           ID            Value
0    James           1             10
1    James 2         2             142
2    Bike            3             1
3    Bicycle         4             1197
4    James Marsh     5             12
5    Ants            6             54
6    Job             7             6
7    Michael         8             80007  
8    Arm             9             47 
9    Mike K          10            9
10   Michael k       11            1

My pseudo code would be something like:

import pandas as pd
from fuzzywuzzy import fuzz

minratio = 95
for idx1, name1 in df['Name'].iteritems():
   for idx2, name2 in df['Name'].iteritems():
      ratio = fuzz.WRatio(name1, name2)
      if ratio > minratio:
          grouped = df.groupby(['Name', 'ID'])['Value']\
                        .agg(Total_Value='sum', Group_Size='count')

This would then give me the desired output:

print(grouped)
     Name           ID            Total_Value          Group_Size
0    James           1             164                     3 # All James' grouped
2    Bike            3             1198                    2 # Bike's and Bicycles grouped
5    Ants            6             54                      1 
6    Job             7             6                       1
7    Michael         8             80017                   3 # Mike's and Michael's grouped
8    Arm             9             47                      1

Obviously this doesn't work, and honestly, I am not sure if this is even possible, but this is what I'm trying to accomplish. Any advice that could get me on the right track would be useful.

like image 275
DrakeMurdoch Avatar asked Feb 02 '26 22:02

DrakeMurdoch


1 Answers

Using affinity propagation clustering (not perfect but maybe a starting point):

import pandas as pd
import numpy as np
import io
from fuzzywuzzy import fuzz
from scipy import spatial
import sklearn.cluster

s="""Name           ID            Value
0    James           1             10
1    James 2         2             142
2    Bike            3             1
3    Bicycle         4             1197
4    James Marsh     5             12
5    Ants            6             54
6    Job             7             6
7    Michael         8             80007  
8    Arm             9             47 
9    Mike K          10            9
10   Michael k       11            1"""
df = pd.read_csv(io.StringIO(s),sep='\s\s+',engine='python')

names = df.Name.values
sim = spatial.distance.pdist(names.reshape((-1,1)), lambda x,y: fuzz.WRatio(x,y))
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", random_state=None)
affprop.fit(spatial.distance.squareform(sim))

res = df.groupby(affprop.labels_).agg(
        Names=('Name',','.join),
        First_ID=('ID','first'),
        Total_Value=('Value','sum'),
        Group_Size=('Value','count')
        )

Result

                                Names  First_ID  Total_Value  Group_Size
0  James,James 2,James Marsh,Ants,Arm         1          265           5
1                        Bike,Bicycle         3         1198           2
2                                 Job         7            6           1
3            Michael,Mike K,Michael k         8        80017           3
like image 116
Stef Avatar answered Feb 05 '26 10:02

Stef