Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select or subset variables whose column sums are not zero

Tags:

r

dplyr

I want to select or subset variables in a data frame whose column sum is not zero but also keeping other factor variables as well. It should be fairly simple but I cannot figure out how to run the select_if() function on a subset of variables using dplyr:

df <- data.frame(
  A = c("a", "a", "b", "c", "c", "d"),
  B = c(0, 0, 0, 0, 0, 0),
  C = c(3, 0, 0, 1, 1, 2),
  D = c(0, 3, 2, 1, 4, 5)
)

require(dplyr)
df %>% 
  select_if(funs(sum(.) > 0))

#Error in Summary.factor(c(1L, 1L, 2L, 3L, 3L, 4L), na.rm = FALSE) : 
#  ‘sum’ not meaningful for factors

Then I tried to only select B, C, D and this works, but I won't have variable A:

df %>% 
  select(-A) %>% 
  select_if(funs(sum(.) > 0)) -> df2
df2
#  C D
#1 3 0
#2 0 3
#3 0 2
#4 1 1
#5 1 4
#6 2 5

I could simply do cbind(A = df$A, df2) but since I have a dataset with 3000 rows and 200 columns, I am afraid this could introduce errors (if values sort differently for example).

Trying to subset variables B, C, D in the sum() function doesn't work either:

df %>% 
  select_if(funs(sum(names(.[2:4])) > 0))
#data frame with 0 columns and 6 rows
like image 895
Stefan Avatar asked Oct 31 '18 07:10

Stefan


2 Answers

Try this:

df %>% select_if(~ !is.numeric(.) || sum(.) != 0)
#   A C D
# 1 a 3 0
# 2 a 0 3
# 3 b 0 2
# 4 c 1 1
# 5 c 1 4
# 6 d 2 5

The rationale is that for || if the left-side is TRUE, the right-side won't be evaluated.

Note:

  • the second argument for select_if should be a function name or formula (lambda function). the ~ is necessary to tell select_if that !is.numeric(.) || sum(.) != 0 should be converted to a function.
  • As commented below by @zx8754, is.factor(.)should be used if one only wants to keep factor columns.

Edit: a base R solution

cols <- c('B', 'C', 'D')
cols.to.keep <- cols[colSums(df[cols]) != 0]
df[!names(df) %in% cols || names(df) %in% cols.to.keep]
like image 79
mt1022 Avatar answered Nov 20 '22 21:11

mt1022


Here is an update for everyone who wants to use the new dplyr 1.0.0 which doesn't have the scoped variants (like select_if as nicely shown by @mt1022 but deprecated):

df %>% 
  select(where(is.numeric)) %>% 
  select(where(~sum(.) != 0))

If you want to compress the two select statements into one, you cannot do this by the element-wise & but longer form && because this produces the required boolean output:

df %>% select(where(~ is.numeric(.x) && sum(.x) !=0 ))
like image 3
Agile Bean Avatar answered Nov 20 '22 21:11

Agile Bean