I have a large data set 45421 * 12 (rows * columns) which contains all categorical variables. There are no numerical variables in my dataset. I would like to use this dataset to build unsupervised clustering model, but before modeling I would like to know the best feature selection model for this dataset. And I am unable to plot elbow curve to this dataset. I am giving range k = 1-1000 in k-means elbow method but it's not giving any optimal clusters plot and taking 8-10 hours to execute. If any one suggests a better solution to this issue it will be a great help.
Code:
data = {'UserName':['infuk_tof', 'infus_llk', 'infaus_kkn', 'infin_mdx'],
'UserClass':['high','low','low','medium','high'],
'UserCountry':['unitedkingdom','unitedstates','australia','india'],
'UserRegion':['EMEA','EMEA','APAC','APAC'],
'UserOrganization':['INFBLRPR','INFBLRHC','INFBLRPR','INFBLRHC'],
'UserAccesstype':['Region','country','country','region']}
df = pd.DataFrame(data)
It is basically a collection of objects based on similarity and dissimilarity between them. KModes clustering is one of the unsupervised Machine Learning algorithms that is used to cluster categorical variables.
Clustering categorical data is a bit difficult than clustering numeric data because of the absence of any natural order, high dimensionality and existence of subspace clustering. One approach for easy handling of data is by converting it into an equivalent numeric form but that have their own limitations.
It is text data and I learned that K means can not handle Non-Numerical data.
K-Medoids. Save this answer. Show activity on this post. It can handle mixed data(numeric and categorical), you just need to feed in the data, it automatically segregates Categorical and Numeric data.
For categorical data like this, K-means is not the appropriate clustering algorithm. You may want to look for a K-modes method, which unfortunately not currently included in scikit-learn package. You may want to look at this package for kmodes available on github: https://github.com/nicodv/kmodes which follows much of the syntax you're used to from scikit-learn.
For more, please see the discussion here: https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With