Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickly remove zero variance variables from a data.frame

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.

Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?

set.seed(1) dat <- data.frame(     A=factor(rep("X",10),levels=c('X','Y')),     B=round(runif(10)*10),     C=rep(10,10),     D=c(rep(10,9),1),     E=factor(rep("A",10)),     F=factor(rep(c("I","J"),5)),     G=c(rep(10,9),NA) ) zeroVar <- function(data, useNA = 'ifany') {     out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})     which(out==1) } 

And here's the result of the process:

> dat    A B  C  D E F  G 1  X 3 10 10 A I 10 2  X 4 10 10 A J 10 3  X 6 10 10 A I 10 4  X 9 10 10 A J 10 5  X 2 10 10 A I 10 6  X 9 10 10 A J 10 7  X 9 10 10 A I 10 8  X 7 10 10 A J 10 9  X 6 10 10 A I 10 10 X 1 10  1 A J NA  > dat[,-zeroVar(dat)]    B  D F  G 1  3 10 I 10 2  4 10 J 10 3  6 10 I 10 4  9 10 J 10 5  2 10 I 10 6  9 10 J 10 7  9 10 I 10 8  7 10 J 10 9  6 10 I 10 10 1  1 J NA  > dat[,-zeroVar(dat, useNA = 'no')]    B  D F 1  3 10 I 2  4 10 J 3  6 10 I 4  9 10 J 5  2 10 I 6  9 10 J 7  9 10 I 8  7 10 J 9  6 10 I 10 1  1 J 
like image 902
Zach Avatar asked Jan 10 '12 14:01

Zach


People also ask

How do you delete a variable in a data frame?

It's easier to remove variables by their position number. All you just need to do is to mention the column index number. In the following code, we are telling R to drop variables that are positioned at first column, third and fourth columns. The minus sign is to drop variables.

Why do we use low variance filter?

Filters out double-compatible columns, whose variance is below a user defined threshold. Columns with low variance are likely to distract certain learning algorithms (in particular those which are distance based) and are therefore better removed.

What is low variance filter?

Low Variance Filter is a useful dimensionality reduction algorithm. To understand it conceptually, we can look at the worldly equivalent of this concept. In raw words, your opinion counts only if it changes. It you are too consistent, nobody needs to ask your choice! The same holds for input parameters.


1 Answers

You may also want to look into the nearZeroVar() function in the caret package.

If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.

like image 163
topepo Avatar answered Sep 22 '22 04:09

topepo