Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kmeans using categorical variables

I have a large data set 45421 * 12 (rows * columns) which contains all categorical variables. There are no numerical variables in my dataset. I would like to use this dataset to build unsupervised clustering model, but before modeling I would like to know the best feature selection model for this dataset. And I am unable to plot elbow curve to this dataset. I am giving range k = 1-1000 in k-means elbow method but it's not giving any optimal clusters plot and taking 8-10 hours to execute. If any one suggests a better solution to this issue it will be a great help.

Code:

data = {'UserName':['infuk_tof', 'infus_llk', 'infaus_kkn', 'infin_mdx'], 
       'UserClass':['high','low','low','medium','high'], 
       'UserCountry':['unitedkingdom','unitedstates','australia','india'], 
       'UserRegion':['EMEA','EMEA','APAC','APAC'], 
       'UserOrganization':['INFBLRPR','INFBLRHC','INFBLRPR','INFBLRHC'], 
       'UserAccesstype':['Region','country','country','region']} 

df = pd.DataFrame(data) 
like image 653
Praveen Avatar asked Dec 12 '19 18:12

Praveen


People also ask

Can I use categorical variables in clustering?

It is basically a collection of objects based on similarity and dissimilarity between them. KModes clustering is one of the unsupervised Machine Learning algorithms that is used to cluster categorical variables.

Why is it difficult to handle categorical data for clustering?

Clustering categorical data is a bit difficult than clustering numeric data because of the absence of any natural order, high dimensionality and existence of subspace clustering. One approach for easy handling of data is by converting it into an equivalent numeric form but that have their own limitations.

Can k-means be used for non numeric data?

It is text data and I learned that K means can not handle Non-Numerical data.

Can K Medoids handle categorical data?

K-Medoids. Save this answer. Show activity on this post. It can handle mixed data(numeric and categorical), you just need to feed in the data, it automatically segregates Categorical and Numeric data.


1 Answers

For categorical data like this, K-means is not the appropriate clustering algorithm. You may want to look for a K-modes method, which unfortunately not currently included in scikit-learn package. You may want to look at this package for kmodes available on github: https://github.com/nicodv/kmodes which follows much of the syntax you're used to from scikit-learn.

For more, please see the discussion here: https://datascience.stackexchange.com/questions/22/k-means-clustering-for-mixed-numeric-and-categorical-data

like image 176
sjc Avatar answered Oct 20 '22 00:10

sjc