Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering in DataFrame using Python

I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster. We can ignore NaN.

Output Example:

Cluster     Names
AZ          A,B
AY,AY       C,E
AZ,AZ       F 

How can I do it using Python?

PS. Actual dataset may vary in number of rows and columns Also, how can I do it using ML based classification algorithms like KNN, Naive Bayes, etc?

like image 856
spd Avatar asked Nov 24 '25 20:11

spd


1 Answers

Use groupby + agg twice; once to join "Systems" and then to join "Names":

s = df.dropna().groupby('Name').agg(', '.join)['System']
s = pd.Series(s.index, index=s)
out = s.groupby(level=0).agg(', '.join).reset_index().rename(columns={'System':'Cluster'})

Output:

  Cluster     Name
0  AY, AY     C, E
1      AZ  A, B, D
2  AZ, AZ        F