Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove data frame column with a single value

Tags:

dataframe

r

Let say I have the following data frame in R:

df1 <- data.frame(Item_Name = c("test1","test2","test3"), D_1=c(1,0,1),
                  D_2=c(1,1,1), D_3=c(11,3,1))

I would like to create a function that would delete columns with no variance (e.g. in this case, it would remove column D_2 because it has only 1 value)

I know that I could check it by hand, but in reality my data is very large and I would like to automate it. Any idea?

like image 900
Benoit_Plante Avatar asked Sep 11 '12 03:09

Benoit_Plante


3 Answers

In dplyr, we can use n_distinct to count unique values and select_if to select columns

library(dplyr)
df1 %>% select(where(~n_distinct(.) > 1))
#For dplyr < 1.0.0
#df1 %>% select_if(~n_distinct(.) > 1)

#  Item_Name D_1 D_3
#1     test1   1  11
#2     test2   0   3
#3     test3   1   1

We can use the same logic with purrr's keep and discard

purrr::keep(df1, ~n_distinct(.) > 1)
purrr::discard(df1, ~n_distinct(.) == 1)

Apart from that data.table way of doing it could be

library(data.table)

setDT(df1)
df1[, lapply(df1, uniqueN) > 1, with = FALSE]

Or probably this is smarter/better

df1[, .SD, .SDcols=lapply(df1, uniqueN) > 1]

In all the above approaches you could replace n_distinct/uniqueN with var or sd function after subsetting only numeric columns.

For example,

df1[-1] %>% select_if(~sd(.) != 0)
like image 152
Ronak Shah Avatar answered Oct 17 '22 08:10

Ronak Shah


Filter is a useful function here. I will filter only for those where there is more than 1 unique value.

i.e.

Filter(function(x)(length(unique(x))>1), df1)

##   Item_Name D_1 D_3
## 1     test1   1  11
## 2     test2   0   3
## 3     test3   1   1
like image 16
mnel Avatar answered Oct 17 '22 06:10

mnel


You can do:

df1[c(TRUE, lapply(df1[-1], var, na.rm = TRUE) != 0)]
#   Item_Name D_1 D_3
# 1     test1   1  11
# 2     test2   0   3
# 3     test3   1   1

where the lapply piece tells you what variables have some variance:

lapply(df1[-1], var, na.rm = TRUE) != 0
#   D_1   D_2   D_3 
#   TRUE FALSE  TRUE 
like image 10
flodel Avatar answered Oct 17 '22 06:10

flodel