Clustering algorithm in R for missing categorical and numerical values

Tags:

I want to perform marketing segmentation clustering on a dataset with missing categorical and numerical values in R. I cannot perform k-means clustering because of the missing values.

R version 3.1.0 (2014-04-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

Mac OSX 10.9.3 4GB hardrive

Is there a clustering algorithm package in R available that can accommodate a partial fill rate? Looking at scholarly articles on missing values, the researchers create a new algorithm for the special use case and the packages are not available in R. For example, k-means with soft constraints and k-means clustering with partial distance strategy.

I have 36 variables, but here is description of the first 5:

head(df)

  user_id    Age   Gender Household.Income Marital.Status
1   12945           Male                                
2   12947           Male                                
3   12990                                                  
4   13160   25-34   Male   100k-125k         Single
5   13195           Male    75k-100k         Single
6   13286

Please let me know if I can provide additional information.

363

asked Jun 03 '14 23:06

Scott Davis

1 Answers

k-means algorithm is usually not preferred in presence of categorical variables. There is a variant of k-means, called k-prototypes, which can handle mixed data types. You can find more about the package that can do this here.

For missing values, you may either remove those rows (which is usually not preferred) or impute suitable values. Generally, for a numeric value, mean value can be imputed and for a categorical variable, mode can be imputed. Or, for imputation, standard packages such as mice can be used.

Ref:

Z.Huang (1998): Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304.

112

answered Sep 30 '22 19:09

prashanth

Related questions
                            
                                Two sidebars in flexdashboard layout
                            
                                How can I access dimensions of labels plotted by `geom_text` in `ggplot2`?
                            
                                Why is Date class in base R backed by a double
                            
                                Bind or merge multiple powerpoints in r
                            
                                `cummean()` function from dplyr not providing results as expected
                            
                                Implementing Longitudinal Random Forest with LongituRF package in R
                            
                                "Error in unserialize" - foreach/doSNOW/snow with SOCK (windows)
                            
                                as.alist.character?
                            
                                R studio failing to use rJava lib and failing javareconf
                            
                                Differences between different types of functions in R
                            
                                Survfit equivalent for coxme in R?
                            
                                Calculate confidence intervals for model averaged data using shrinkage in R
                            
                                Get element names in rapply
                            
                                Are compiled R packages backward compatible?
                            
                                When does setting 'perl=TRUE' in 'strsplit' does not work (as intended or at all)?
                            
                                R - merge() returns NA´s in ALL columns although all.x=T
                            
                                Shiny: preventing initial error messages in endpoints while conductor executes
                            
                                Fast test if directory is empty
                            
                                Odd Behavior with Greedy Modifiers Inside Capture Groups
                            
                                Safely merge data frames by factor columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Clustering algorithm in R for missing categorical and numerical values

Tags:

r

missing-data

machine-learning

cluster-analysis

Scott Davis

People also ask

1 Answers

prashanth

Recent Activity

Donate For Us