Let say I have the following data frame in R:
df1 <- data.frame(Item_Name = c("test1","test2","test3"), D_1=c(1,0,1),
D_2=c(1,1,1), D_3=c(11,3,1))
I would like to create a function that would delete columns with no variance
(e.g. in this case, it would remove column D_2
because it has only 1 value)
I know that I could check it by hand, but in reality my data is very large and I would like to automate it. Any idea?
In dplyr
, we can use n_distinct
to count unique values and select_if
to select columns
library(dplyr)
df1 %>% select(where(~n_distinct(.) > 1))
#For dplyr < 1.0.0
#df1 %>% select_if(~n_distinct(.) > 1)
# Item_Name D_1 D_3
#1 test1 1 11
#2 test2 0 3
#3 test3 1 1
We can use the same logic with purrr
's keep
and discard
purrr::keep(df1, ~n_distinct(.) > 1)
purrr::discard(df1, ~n_distinct(.) == 1)
Apart from that data.table
way of doing it could be
library(data.table)
setDT(df1)
df1[, lapply(df1, uniqueN) > 1, with = FALSE]
Or probably this is smarter/better
df1[, .SD, .SDcols=lapply(df1, uniqueN) > 1]
In all the above approaches you could replace n_distinct
/uniqueN
with var
or sd
function after subsetting only numeric columns.
For example,
df1[-1] %>% select_if(~sd(.) != 0)
Filter
is a useful function here. I will filter only for those where there is more than 1 unique value.
i.e.
Filter(function(x)(length(unique(x))>1), df1)
## Item_Name D_1 D_3
## 1 test1 1 11
## 2 test2 0 3
## 3 test3 1 1
You can do:
df1[c(TRUE, lapply(df1[-1], var, na.rm = TRUE) != 0)]
# Item_Name D_1 D_3
# 1 test1 1 11
# 2 test2 0 3
# 3 test3 1 1
where the lapply
piece tells you what variables have some variance:
lapply(df1[-1], var, na.rm = TRUE) != 0
# D_1 D_2 D_3
# TRUE FALSE TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With