I want to select or subset variables in a data frame whose column sum is not zero but also keeping other factor variables as well. It should be fairly simple but I cannot figure out how to run the select_if()
function on a subset of variables using dplyr
:
df <- data.frame(
A = c("a", "a", "b", "c", "c", "d"),
B = c(0, 0, 0, 0, 0, 0),
C = c(3, 0, 0, 1, 1, 2),
D = c(0, 3, 2, 1, 4, 5)
)
require(dplyr)
df %>%
select_if(funs(sum(.) > 0))
#Error in Summary.factor(c(1L, 1L, 2L, 3L, 3L, 4L), na.rm = FALSE) :
# ‘sum’ not meaningful for factors
Then I tried to only select B, C, D
and this works, but I won't have variable A
:
df %>%
select(-A) %>%
select_if(funs(sum(.) > 0)) -> df2
df2
# C D
#1 3 0
#2 0 3
#3 0 2
#4 1 1
#5 1 4
#6 2 5
I could simply do cbind(A = df$A, df2)
but since I have a dataset with 3000 rows and 200 columns, I am afraid this could introduce errors (if values sort differently for example).
Trying to subset variables B, C, D
in the sum()
function doesn't work either:
df %>%
select_if(funs(sum(names(.[2:4])) > 0))
#data frame with 0 columns and 6 rows
Try this:
df %>% select_if(~ !is.numeric(.) || sum(.) != 0)
# A C D
# 1 a 3 0
# 2 a 0 3
# 3 b 0 2
# 4 c 1 1
# 5 c 1 4
# 6 d 2 5
The rationale is that for ||
if the left-side is TRUE
, the right-side won't be evaluated.
Note:
select_if
should be a function name or formula (lambda function). the ~
is necessary to tell select_if
that !is.numeric(.) || sum(.) != 0
should be converted to a function.is.factor(.)
should be used if one only wants to keep factor
columns.Edit: a base R solution
cols <- c('B', 'C', 'D')
cols.to.keep <- cols[colSums(df[cols]) != 0]
df[!names(df) %in% cols || names(df) %in% cols.to.keep]
Here is an update for everyone who wants to use the new dplyr 1.0.0
which doesn't have the scoped variants (like select_if
as nicely shown by @mt1022 but deprecated):
df %>%
select(where(is.numeric)) %>%
select(where(~sum(.) != 0))
If you want to compress the two select statements into one, you cannot do this by the element-wise &
but longer form &&
because this produces the required boolean output:
df %>% select(where(~ is.numeric(.x) && sum(.x) !=0 ))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With