I want to perform marketing segmentation clustering on a dataset with missing categorical and numerical values in R. I cannot perform k-means clustering because of the missing values.
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
Mac OSX 10.9.3 4GB hardrive
Is there a clustering algorithm package in R available that can accommodate a partial fill rate? Looking at scholarly articles on missing values, the researchers create a new algorithm for the special use case and the packages are not available in R. For example, k-means with soft constraints and k-means clustering with partial distance strategy.
I have 36 variables, but here is description of the first 5:
head(df)
user_id Age Gender Household.Income Marital.Status
1 12945 Male
2 12947 Male
3 12990
4 13160 25-34 Male 100k-125k Single
5 13195 Male 75k-100k Single
6 13286
Please let me know if I can provide additional information.
When missing values is from categorical columns such as string or numerical then the missing values can be replaced with the most frequent category. If the number of missing values is very large then it can be replaced with a new category.
In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization).
k-Modes is an algorithm that is based on the k-Means algorithm paradigm and it is used for clustering categorical data. k-modes defines clusters based on matching categories between the data points.
– Generally, replacing the missing values with the mean/median/mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could give satisfactory results.
k-means algorithm is usually not preferred in presence of categorical variables. There is a variant of k-means, called k-prototypes, which can handle mixed data types. You can find more about the package that can do this here.
For missing values, you may either remove those rows (which is usually not preferred) or impute suitable values. Generally, for a numeric value, mean value can be imputed and for a categorical variable, mode can be imputed. Or, for imputation, standard packages such as mice can be used.
Ref:
Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With