I'm exploring the possibility of clustering some categorial data with python. I have currently 8 features each with approximately 3-10 levels. As I understood both one-hot encoding with kmeans and kmodes can be used in this framework, with kmeans getting maybe not-ideal with huge combinations of features/levels due to curse of dimensionality problems. Is this correct? At the moment I would follow the kmeans route because it would give me the flexibility to throw in some numerical features as well and computing the silhouette statistic and assessing the optimal number of clusters seems to be much easier. Does this make sense? Do you have any suggestion on situations in which one approach should be preferred over the other? Thanks

Refer to this paper by Huang (author of Kmodes). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.83&rep=rep1&type=pdf <ol> <li> He mentions that if we use Kmeans + one hot encoding it will increase the size of the dataset extensively if the categorical attributes have a large number of categories. This will make the Kmeans computationally costly. So yes your idea of curse of dimensionality is right. </li> <li> Also the cluster means will make no sense since the 0 and 1 are not the real values of the data. Kmodes on the other hand produces cluster modes which are the real data and hence make the clusters interpretable. </li> </ol> For your requirement of both numerical and categorical attributes, look at the k-prototypes method which combines kmeans and kmodes with the use of a balancing weight factor. (Again explained in the paper). Code sample in python

kmodes VS one-hot encoding + kmeans for categorical data?

Tags:

python

cluster-analysis

k-means

I'm exploring the possibility of clustering some categorial data with python. I have currently 8 features each with approximately 3-10 levels.

As I understood both one-hot encoding with kmeans and kmodes can be used in this framework, with kmeans getting maybe not-ideal with huge combinations of features/levels due to curse of dimensionality problems.

Is this correct?

At the moment I would follow the kmeans route because it would give me the flexibility to throw in some numerical features as well and computing the silhouette statistic and assessing the optimal number of clusters seems to be much easier.

Does this make sense? Do you have any suggestion on situations in which one approach should be preferred over the other?

Thanks

200

asked May 16 '19 15:05

crash

1 Answers

Refer to this paper by Huang (author of Kmodes). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.134.83&rep=rep1&type=pdf

He mentions that if we use Kmeans + one hot encoding it will increase the size of the dataset extensively if the categorical attributes have a large number of categories. This will make the Kmeans computationally costly. So yes your idea of curse of dimensionality is right.
Also the cluster means will make no sense since the 0 and 1 are not the real values of the data. Kmodes on the other hand produces cluster modes which are the real data and hence make the clusters interpretable.

For your requirement of both numerical and categorical attributes, look at the k-prototypes method which combines kmeans and kmodes with the use of a balancing weight factor. (Again explained in the paper).

Code sample in python

148

answered Oct 11 '22 05:10

conflicted_user

Related questions
                            
                                Pandas DataFrame: mean of column B values within column A windows
                            
                                Django update_or_create (get part) using related object as kwarg
                            
                                How to put multiple colormap patches in a matplotlib legend?
                            
                                Convert UTC timestamp to local timezone issue in pandas
                            
                                How to import one databricks notebook into another?
                            
                                Joining Two Different Dataframes on Timestamp
                            
                                Calculating Rolling forward averages with pandas
                            
                                How to validate html forms in python Flask?
                            
                                Why is there so much speed difference between these two variants?
                            
                                Extracting parts of array repeatedly
                            
                                when extending python with c, how do one cope with arbitrary size integers?
                            
                                How to create a tree from a list of subtrees?
                            
                                What is the best way to run python scripts in AWS?
                            
                                Why is my Flask error handler not being called?
                            
                                Overhead of python multiprocessing initialization is worse than benefits
                            
                                Binary-vectorize pandas DataFrame column
                            
                                How does pytest.approx accomplish its magic?
                            
                                Take the difference of all elements of a series with the previous ones in python pandas
                            
                                Create new variables from row for each existing variable in pandas dataframe
                            
                                How to override a pytest fixture calling the original in pytest 4

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With