Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dimension Reduction in Categorical Data with missing values

Tags:

I have a regression model in which the dependent variable is continuous but ninety percent of the independent variables are categorical(both ordered and unordered) and around thirty percent of the records have missing values(to make matters worse they are missing randomly without any pattern, that is, more that forty five percent of the data hava at least one missing value). There is no a priori theory to choose the specification of the model so one of the key tasks is dimension reduction before running the regression. While I am aware of several methods for dimension reduction for continuous variables I am not aware of a similar statical literature for categorical data (except, perhaps, as a part of correspondence analysis which is basically a variation of principal component analysis on frequency table). Let me also add that the dataset is of moderate size 500000 observations with 200 variables. I have two questions.

  1. Is there a good statistical reference out there for dimension reduction for categorical data along with robust imputation (I think the first issue is imputation and then dimension reduction)?
  2. This is linked to implementation of above problem. I have used R extensively earlier and tend to use transcan and impute function heavily for continuous variables and use a variation of tree method to impute categorical values. I have a working knowledge of Python so if something is nice out there for this purpose then I will use it. Any implementation pointers in python or R will be of great help. Thank you.
like image 298
user227290 Avatar asked May 14 '10 21:05

user227290


People also ask

How do you handle missing values in categorical data?

When missing values is from categorical columns such as string or numerical then the missing values can be replaced with the most frequent category. If the number of missing values is very large then it can be replaced with a new category.

What methods can be used to replace missing categorical values?

– Generally, replacing the missing values with the mean/median/mode is a crude way of treating missing values. Depending on the context, like if the variation is low or if the variable has low leverage over the response, such a rough approximation is acceptable and could give satisfactory results.

When would you reduce dimension in your data?

Dimensionality reduction refers to techniques for reducing the number of input variables in training data. When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data.

Can you do PCA on categorical variables?

While it is technically possible to use PCA on discrete variables, or categorical variables that have been one hot encoded variables, you should not. Simply put, if your variables don't belong on a coordinate plane, then do not apply PCA to them.


1 Answers

Regarding imputation of categorical data, I would suggest to check the mice package. Also take a look at this presentation which explains how it imputes multivariate categorical data. Another package for Mutliple Imputation of Incomplete Multivariate Data is Amelia. Amelia includes some limited capacity to deal with ordinal and nominal variables.

As for dimensionality reduction for categorical data (i.e. a way to arrange variables into homogeneous clusters), I would suggest the method of Multiple Correspondence Analysis which will give you the latent variables that maximize the homogeneity of the clusters. Similarly to what is done in Principal Component Analysis (PCA) and Factor Analysis, the MCA solution can also be rotated to increase the components simplicity. The idea behind a rotation is to find subsets of variables which coincide more clearly with the rotated components. This implies that maximizing components simplicity can help in factor interpretation and in variables clustering. In R MCA methods are included in packages ade4, MASS, FactoMineR and ca (at least). As for FactoMineR, you can use it through a graphical interface if you add it as an extra menu to the ones already proposed by the Rcmdr package, installing the RcmdrPlugin.FactoMineR

like image 88
George Dontas Avatar answered Sep 27 '22 20:09

George Dontas