Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

kmodes VS one-hot encoding + kmeans for categorical data?

I'm exploring the possibility of clustering some categorial data with python. I have currently 8 features each with approximately 3-10 levels.

As I understood both one-hot encoding with kmeans and kmodes can be used in this framework, with kmeans getting maybe not-ideal with huge combinations of features/levels due to curse of dimensionality problems.

Is this correct?

At the moment I would follow the kmeans route because it would give me the flexibility to throw in some numerical features as well and computing the silhouette statistic and assessing the optimal number of clusters seems to be much easier.

Does this make sense? Do you have any suggestion on situations in which one approach should be preferred over the other?

Thanks

like image 200
crash Avatar asked May 16 '19 15:05

crash


People also ask

What is the k-means and K-modes algorithm?

k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.)

Why k-means is not suitable for categorical data type?

It’s actually not suitable for the data that contains the categorical data type. So, Huang proposed an algorithm called k-Modes which is created in order to handle clustering algorithms with the categorical data type. The modification of k-Modes as the improvement of k-Means for categorical variables can be found here.

What is kmodes clustering?

KModes clustering is one of the unsupervised Machine Learning algorithms that is used to cluster categorical variables. You might be wondering, why KModes when we already have KMeans.

What are the different types of encoding in categorical data?

Encoding Categorical Data 1 Ordinal Encoding. In ordinal encoding, each unique category value is assigned an integer value. ... 2 One-Hot Encoding. For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst. 3 Dummy Variable Encoding. ...


1 Answers

Refer to this paper by Huang (author of Kmodes). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.83&rep=rep1&type=pdf

  1. He mentions that if we use Kmeans + one hot encoding it will increase the size of the dataset extensively if the categorical attributes have a large number of categories. This will make the Kmeans computationally costly. So yes your idea of curse of dimensionality is right.

  2. Also the cluster means will make no sense since the 0 and 1 are not the real values of the data. Kmodes on the other hand produces cluster modes which are the real data and hence make the clusters interpretable.

For your requirement of both numerical and categorical attributes, look at the k-prototypes method which combines kmeans and kmodes with the use of a balancing weight factor. (Again explained in the paper).

Code sample in python

like image 148
conflicted_user Avatar answered Oct 11 '22 05:10

conflicted_user