So I've seen many pages on the generalized version of this issue but here specifically I would like to sum all values in a row after a specific column.
Let's say we have this df:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0111 boston fitz 0 0 0
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
0114 ontario wazaaa NA 11 NA
Now the df's I work with aren't usually with 3 "q" variables, they vary. Hence, I would like to rowSum every row but only sum the rows that are after the column identity
.
Rows with NA are to be ignored.
Eventually I would like to take the rows which sum to 0 to be removed and end with a df that looks like this:
id city identity q1 q2 q3
0110 detroit ella 2 4 3
0112 philly gerald 3 1 0
0113 new_york doowop 8 11 2
Doing this in dplyr is the preference but not required.
EDIT:
I have added below the data of which this solution is not working for, apologies for the confusion.
df <- structure(list(Program = c("3002", "111", "2455", "2929", "NA",
"NA", NA), Project_ID = c("299", "11", "271", "780", "207", "222",
NA), Advance_Identifier = c(14, 24, 12, 15, NA, 11, NA), Sequence = c(6,
4, 4, 5, 2, 3, 79), Item = c("payment", "hero", "prepayment_2",
"UPS", "period", "prepayment", "yeet"), q1 = c("500", "12", "-1",
"0", NA, "0", "0"), q2 = c("500", "12", "-1", "0", NA, "0", "1"
), q3 = c("500", "12", "2", "0", NA, "0", "2"), q4 = c("500",
"13", "0", "0", NA, "0", "3")), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Base R version with zero extra dependencies:
[Edit: I always forget rowSums
exists]
> df1$new = rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)
> df1
id city identity q1 q2 q3 new
1 110 detroit ella 2 4 3 9
2 111 boston fitz 0 0 0 0
3 112 philly gerald 3 1 0 4
4 113 new_york doowop 8 11 2 21
If you need to convert chars to numbers, use apply
with as.numeric
:
df$new = apply(df[,(1+which(names(df)=="Item")):ncol(df),drop=FALSE], 1, function(col){sum(as.numeric(col))})
BUT look out if they are really factors because this will fail, which is why converting things that look like numbers to numbers before you do anything else is a Good Thing.
In case you are worried about speed here's a benchmark test of my function against the currently accepted solution:
akrun = function(df1){df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
sample data
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE), identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
Test - note I remove the new
column from the source data frame each time otherwise the code keeps adding one of those into it (although akrun
doesn't modify df
in place it can get run after baz
has modified it by assigning it the new column in the benchmark code).
> microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
Unit: microseconds
expr min lq mean
{ df$new = NULL df2 = akrun(df) } 1300.682 1328.941 1396.63477
{ df$new = NULL df$new = baz(df) } 63.102 72.721 87.78668
median uq max neval
1376.9425 1398.5880 2075.894 100
84.3655 86.7005 685.594 100
The tidyverse version takes 16 times as long as the base R version.
We can use
out <- df1 %>%
mutate(new = rowSums(select(., ((match('identity', names(.)) +
1):ncol(.))), na.rm = TRUE))
out
# id city identity q1 q2 q3 new
#1 110 detroit ella 2 4 3 9
#2 111 boston fitz 0 0 0 0
#3 112 philly gerald 3 1 0 4
#4 113 new_york doowop 8 11 2 21
and then filter
out the rows that have 0 in 'new'
out %>%
filter(new >0)
In the OP's updated dataset, the type
of columns are character
. We can automatically convert the type
s to respective types with
df %>%
#type.convert %>% # base R
# or with `readr::type_convert
type_convert %>%
...
NOTE: The OP mentioned in the title and in the description about a tidyverse
option. It is not a question about efficiency.
Also, rowSums
is a base R
option. Here, we showed how to use that in tidyverse
chain. I could have written an answer in base R
way too earlier with the same option.
If we remove the select
, it becomes just a base R
i.e
df1$new < rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
df = data.frame(id=sample(100,100), city=sample(LETTERS,100,TRUE),
identity=sample(letters,100,TRUE), q1=runif(100), q2=runif(100),q3=runif(100))
akrun = function(df1){
rowSums(df1[(match('identity', names(df1)) + 1):ncol(df1)], na.rm = TRUE)
}
baz = function(df1){rowSums(
df1[,(1+which(names(df1)=="identity")):ncol(df1),drop=FALSE]
)}
microbenchmark({df$new=NULL;df2 = akrun(df)},{df$new=NULL;df$new=baz(df)})
#Unit: microseconds
# expr min lq mean median uq max neval
# { df$new = NULL df2 = akrun(df) } 69.926 73.244 112.2078 75.4335 78.7625 3539.921 100
# { df$new = NULL df$new = baz(df) } 73.670 77.945 118.3875 80.5045 83.5100 3767.812 100
df1 <- structure(list(id = 110:113, city = c("detroit", "boston", "philly",
"new_york"), identity = c("ella", "fitz", "gerald", "doowop"),
q1 = c(2L, 0L, 3L, 8L), q2 = c(4L, 0L, 1L, 11L), q3 = c(3L,
0L, 0L, 2L)), class = "data.frame", row.names = c(NA, -4L
))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With