Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum all cells to the right of a column in each row using Dplyr

Tags:

r

dplyr

sum

row

So I've seen many pages on the generalized version of this issue but here specifically I would like to sum all values in a row after a specific column.

Let's say we have this df:

id    city      identity   q1   q2   q3
0110  detroit   ella       2    4    3
0111  boston    fitz       0    0    0
0112  philly    gerald     3    1    0
0113  new_york  doowop     8    11   2
0114  ontario   wazaaa     NA   11   NA

Now the df's I work with aren't usually with 3 "q" variables, they vary. Hence, I would like to rowSum every row but only sum the rows that are after the column identity.

Rows with NA are to be ignored.

Eventually I would like to take the rows which sum to 0 to be removed and end with a df that looks like this:

id    city      identity   q1   q2   q3
0110  detroit   ella       2    4    3
0112  philly    gerald     3    1    0
0113  new_york  doowop     8    11   2

Doing this in dplyr is the preference but not required.

EDIT:

I have added below the data of which this solution is not working for, apologies for the confusion.

df <- structure(list(Program = c("3002", "111", "2455", "2929", "NA", 
"NA", NA), Project_ID = c("299", "11", "271", "780", "207", "222", 
NA), Advance_Identifier = c(14, 24, 12, 15, NA, 11, NA), Sequence = c(6, 
4, 4, 5, 2, 3, 79), Item = c("payment", "hero", "prepayment_2", 
"UPS", "period", "prepayment", "yeet"), q1 = c("500", "12", "-1", 
"0", NA, "0", "0"), q2 = c("500", "12", "-1", "0", NA, "0", "1"
), q3 = c("500", "12", "2", "0", NA, "0", "2"), q4 = c("500", 
"13", "0", "0", NA, "0", "3")), row.names = c(NA, -7L), class = c("tbl_df", 
"tbl", "data.frame"))

like image 467
Johnny Thomas Avatar asked Dec 04 '22 18:12

Johnny Thomas


2 Answers

Base R version with zero extra dependencies:

[Edit: I always forget rowSums exists]

> df1$new = rowSums(
    df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
    )


> df1
   id     city identity q1 q2 q3 new
1 110  detroit     ella  2  4  3   9
2 111   boston     fitz  0  0  0   0
3 112   philly   gerald  3  1  0   4
4 113 new_york   doowop  8 11  2  21

If you need to convert chars to numbers, use apply with as.numeric:

df$new = apply(df[,(1+which(names(df)=="Item")):ncol(df),drop=FALSE], 1, function(col){sum(as.numeric(col))})

BUT look out if they are really factors because this will fail, which is why converting things that look like numbers to numbers before you do anything else is a Good Thing.

Benchmark

In case you are worried about speed here's a benchmark test of my function against the currently accepted solution:

akrun = function(df1){df1 %>%
   mutate(new = rowSums(select(., ((match('identity', names(.)) + 
           1):ncol(.))), na.rm = TRUE))}

baz = function(df1){rowSums(
    df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
    )}

sample data

df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))

Test - note I remove the new column from the source data frame each time otherwise the code keeps adding one of those into it (although akrun doesn't modify df in place it can get run after baz has modified it by assigning it the new column in the benchmark code).

> microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
Unit: microseconds
                                       expr      min       lq       mean
  {     df$new = NULL     df2 = akrun(df) } 1300.682 1328.941 1396.63477
 {     df$new = NULL     df$new = baz(df) }   63.102   72.721   87.78668
    median        uq      max neval
 1376.9425 1398.5880 2075.894   100
   84.3655   86.7005  685.594   100

The tidyverse version takes 16 times as long as the base R version.

like image 80
Spacedman Avatar answered Dec 11 '22 15:12

Spacedman


We can use

out <- df1 %>%
   mutate(new = rowSums(select(., ((match('identity', names(.)) + 
           1):ncol(.))), na.rm = TRUE))
out
#    id     city identity q1 q2 q3 new
#1 110  detroit     ella  2  4  3   9
#2 111   boston     fitz  0  0  0   0
#3 112   philly   gerald  3  1  0   4
#4 113 new_york   doowop  8 11  2  21

and then filter out the rows that have 0 in 'new'

out %>%
    filter(new >0)

In the OP's updated dataset, the type of columns are character. We can automatically convert the types to respective types with

df %>%
    #type.convert %>% # base R
    # or with `readr::type_convert
     type_convert %>%
    ... 

NOTE: The OP mentioned in the title and in the description about a tidyverse option. It is not a question about efficiency.

Also, rowSums is a base R option. Here, we showed how to use that in tidyverse chain. I could have written an answer in base R way too earlier with the same option.

If we remove the select, it becomes just a base R i.e

df1$new < rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)

Benchmarks

df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), 
      identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
akrun = function(df1){
 rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
}



baz = function(df1){rowSums(
    df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
    )}

microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
#Unit: microseconds
#                                       expr    min     lq     mean  median      uq      max neval
#  {     df$new = NULL     df2 = akrun(df) } 69.926 73.244 112.2078 75.4335 78.7625 3539.921   100
# {     df$new = NULL     df$new = baz(df) } 73.670 77.945 118.3875 80.5045 83.5100 3767.812   100

data

df1 <- structure(list(id = 110:113, city = c("detroit", "boston", "philly", 
"new_york"), identity = c("ella", "fitz", "gerald", "doowop"), 
    q1 = c(2L, 0L, 3L, 8L), q2 = c(4L, 0L, 1L, 11L), q3 = c(3L, 
    0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -4L
))
like image 23
akrun Avatar answered Dec 11 '22 15:12

akrun