Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering algorithm in R for missing categorical and numerical values

I want to perform marketing segmentation clustering on a dataset with missing categorical and numerical values in R. I cannot perform k-means clustering because of the missing values.

R version 3.1.0 (2014-04-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

Mac OSX 10.9.3 4GB hardrive

Is there a clustering algorithm package in R available that can accommodate a partial fill rate? Looking at scholarly articles on missing values, the researchers create a new algorithm for the special use case and the packages are not available in R. For example, k-means with soft constraints and k-means clustering with partial distance strategy.

I have 36 variables, but here is description of the first 5:

head(df)

  user_id    Age   Gender Household.Income Marital.Status
1   12945           Male                                
2   12947           Male                                
3   12990                                                  
4   13160   25-34   Male   100k-125k         Single
5   13195           Male    75k-100k         Single
6   13286                                               

Please let me know if I can provide additional information.

like image 363
Scott Davis Avatar asked Jun 03 '14 23:06

Scott Davis


People also ask

How do you handle missing values in categorical variables in R?

When missing values is from categorical columns such as string or numerical then the missing values can be replaced with the most frequent category. If the number of missing values is very large then it can be replaced with a new category.

How do you deal with missing values in clustering?

In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization).

Which clustering algorithm is suitable if the data type is categorical?

k-Modes is an algorithm that is based on the k-Means algorithm paradigm and it is used for clustering categorical data. k-modes defines clusters based on matching categories between the data points.

What methods can be used to replace missing categorical values?

– Generally, replacing the missing values with the mean/median/mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could give satisfactory results.


1 Answers

k-means algorithm is usually not preferred in presence of categorical variables. There is a variant of k-means, called k-prototypes, which can handle mixed data types. You can find more about the package that can do this here.

For missing values, you may either remove those rows (which is usually not preferred) or impute suitable values. Generally, for a numeric value, mean value can be imputed and for a categorical variable, mode can be imputed. Or, for imputation, standard packages such as mice can be used.

Ref:

Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.

like image 112
prashanth Avatar answered Sep 30 '22 19:09

prashanth