Count number of values in row using dplyr

Question

This question should have a simple, elegant solution but I can't figure it out, so here it goes:

Let's say I have the following dataset and I want to count the number of 2s present in each row using dplyr.

set.seed(1)
ID <- LETTERS[1:5]
X1 <- sample(1:5, 5,T)
X2 <- sample(1:5, 5,T)
X3 <- sample(1:5, 5,T)

df <- data.frame(ID,X1,X2,X3)
library(dplyr)

Now, the following works:

df %>%
  rowwise %>%
  mutate(numtwos = sum(c(X1,X2,X3) == 2))

But how do I avoid typing out all of the column names?

I know this is probably easier to do without dplyr, but more generally I want to know how I can use dplyr's mutate with multiple columns without typing out all the column names.

evan.oman · Accepted Answer

Try rowSums:

> set.seed(1)
> ID <- LETTERS[1:5]
> X1 <- sample(1:5, 5,T)
> X2 <- sample(1:5, 5,T)
> X3 <- sample(1:5, 5,T)
> df <- data.frame(ID,X1,X2,X3)
> df
  ID X1 X2 X3
1  A  2  5  2
2  B  2  5  1
3  C  3  4  4
4  D  5  4  2
5  E  2  1  4
> rowSums(df == 2)
[1] 2 1 0 1 1

Alternatively, with dplyr:

> df %>% mutate(numtwos = rowSums(. == 2))
  ID X1 X2 X3 numtwos
1  A  2  5  2       2
2  B  2  5  1       1
3  C  3  4  4       0
4  D  5  4  2       1
5  E  2  1  4       1

Steven Beaupré · Answer

Here's another alternative using purrr:

library(purrr)

df %>%
  by_row(function(x) {
    sum(x[-1] == 2) },
    .to = "numtwos",
    .collate = "cols"
  )

Which gives:

#Source: local data frame [5 x 5]
#
#      ID    X1    X2    X3 numtwos
#  <fctr> <int> <int> <int>   <int>
#1      A     2     5     2       2
#2      B     2     5     1       1
#3      C     3     4     4       0
#4      D     5     4     2       1
#5      E     2     1     4       1

As per mentioned in the NEWS, row based functionals are still maturing in dplyr:

We are still figuring out what belongs in dplyr and what belongs in purrr. Expect much experimentation and many changes with these functions.

Benchmark

We can see how rowwise() and do() compare to purrr::by_row() for this type of problem and how they "perform" against rowSums() and the tidy data way:

largedf <-  df[rep(seq_len(nrow(df)), 10e3), ]

library(microbenchmark)
microbenchmark(
  steven = largedf %>% 
    by_row(function(x) { 
      sum(x[-1] == 2) }, 
      .to = "numtwos", 
      .collate = "cols"),
  psidom = largedf %>% 
    rowwise %>% 
    do(data_frame(numtwos = sum(.[-1] == 2))) %>% 
    cbind(largedf, .),
  gopala = largedf %>% 
    gather(key, value, -ID) %>% 
    group_by(ID) %>% 
    summarise(numtwos = sum(value == 2)) %>% 
    inner_join(largedf, .),
  evan   = largedf %>% 
    mutate(numtwos = rowSums(. == 2)),
  times  = 10L,
  unit   = "relative"
)

Results:

#Unit: relative
#   expr         min          lq        mean      median         uq         max neval cld
# steven 1225.190659 1261.466936 1267.737126 1227.762573 1276.07977 1339.841636    10  b 
# psidom 3677.603240 3759.402212 3726.891458 3678.717170 3728.78828 3777.425492    10   c
# gopala    2.715005    2.684599    2.638425    2.612631    2.59827    2.572972    10 a  
#   evan    1.000000    1.000000    1.000000    1.000000    1.00000    1.000000    10 a

Count number of values in row using dplyr

Tags:

r

dplyr

C_Z_

2 Answers

evan.oman

Steven Beaupré

Recent Activity

Donate For Us

Count number of values in row using dplyr

Tags:

r

dplyr

C_Z_

2 Answers

evan.oman

Steven Beaupré

Related questions

Recent Activity

Donate For Us