Quickly remove zero variance variables from a data.frame

Tags:

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.

Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?

Click to copy

set.seed(1) dat <- data.frame(     A=factor(rep("X",10),levels=c('X','Y')),     B=round(runif(10)*10),     C=rep(10,10),     D=c(rep(10,9),1),     E=factor(rep("A",10)),     F=factor(rep(c("I","J"),5)),     G=c(rep(10,9),NA) ) zeroVar <- function(data, useNA = 'ifany') {     out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})     which(out==1) }

And here's the result of the process:

Click to copy

> dat    A B  C  D E F  G 1  X 3 10 10 A I 10 2  X 4 10 10 A J 10 3  X 6 10 10 A I 10 4  X 9 10 10 A J 10 5  X 2 10 10 A I 10 6  X 9 10 10 A J 10 7  X 9 10 10 A I 10 8  X 7 10 10 A J 10 9  X 6 10 10 A I 10 10 X 1 10  1 A J NA  > dat[,-zeroVar(dat)]    B  D F  G 1  3 10 I 10 2  4 10 J 10 3  6 10 I 10 4  9 10 J 10 5  2 10 I 10 6  9 10 J 10 7  9 10 I 10 8  7 10 J 10 9  6 10 I 10 10 1  1 J NA  > dat[,-zeroVar(dat, useNA = 'no')]    B  D F 1  3 10 I 2  4 10 J 3  6 10 I 4  9 10 J 5  2 10 I 6  9 10 J 7  9 10 I 8  7 10 J 9  6 10 I 10 1  1 J

902

asked Jan 10 '12 14:01

Zach

1 Answers

You may also want to look into the nearZeroVar() function in the caret package.

If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.

163

answered Sep 22 '22 04:09

topepo

Related questions
                            
                                How to remove outliers in boxplot in R? [duplicate]
                            
                                Saving multiple outputs of foreach dopar loop
                            
                                detect non ascii characters in a string
                            
                                Python pandas equivalent to R groupby mutate
                            
                                Fonts in R plots
                            
                                Apply function to every value in an R dataframe
                            
                                Control number of decimal places on xtable output in R
                            
                                Error in terms.formula(formula) : '.' in formula and no 'data' argument
                            
                                Reliable way to detect if a column in a data.frame is.POSIXct
                            
                                How to sort files list by date?
                            
                                Is there a faster lm function
                            
                                dplyr: inner_join with a partial string match
                            
                                Skip specific rows using read.csv in R [duplicate]
                            
                                Dividing columns by colSums in R
                            
                                Is set.seed consistent over different versions of R (and Ubuntu)?
                            
                                Clustering list for hclust function
                            
                                min for each row in a data frame
                            
                                Installing nloptr on Linux
                            
                                Concatenate strings and expressions in a plot's title
                            
                                completely uninstall r linux

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Quickly remove zero variance variables from a data.frame

Tags:

r

data-management

Zach

People also ask

1 Answers

topepo

Recent Activity

Donate For Us