I have a dataset
Name System
A AZ
A NaN
B AZ
B NaN
B NaN
C AY
C AY
D AZ
E AY
E AY
E NaN
F AZ
F AZ
F NaN
Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".
In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster. We can ignore NaN.
Output Example:
Cluster Names
AZ A,B
AY,AY C,E
AZ,AZ F
How can I do it using Python?
PS. Actual dataset may vary in number of rows and columns Also, how can I do it using ML based classification algorithms like KNN, Naive Bayes, etc?
Use groupby + agg twice; once to join "Systems" and then to join "Names":
s = df.dropna().groupby('Name').agg(', '.join)['System']
s = pd.Series(s.index, index=s)
out = s.groupby(level=0).agg(', '.join).reset_index().rename(columns={'System':'Cluster'})
Output:
Cluster Name
0 AY, AY C, E
1 AZ A, B, D
2 AZ, AZ F
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With