R: How to sum values from rows only if the key value is the same and also if the key duplicated in consecutive rows?

Question

I have the following data table of this structure:

+-------------------+
| id  | key | value |
+-----+-----+-------+
| 1   | A   | 1000  |
| 1   | A   | 2000  |
| 1   | B   | 2001  |
| 1   | A   | 2002  |
| 1   | A   | 2004  |
| 2   | B   | 2002  |
| 2   | C   | 2002  |
+-------------------+

My objective is to sum the values by id and key but instead of just grouping by id and key, I only want to sum the values if the pair of id and key are the same for consecutive rows.

The result should be:

+-------------------+
| id  | key | value |
+-----+-----+-------+
| 1   | A   | 3000  |
| 1   | B   | 2001  |
| 1   | A   | 4006  |
| 2   | B   | 2002  |
| 2   | C   | 2002  |
+-------------------+

Is there anyway to achieve this result?

akrun · Accepted Answer

We can use rleid from data.table.

We convert the 'data.frame' to 'data.table'. Create another grouping column 'ind' from the 'key' column. Grouped by 'id' and 'ind', we get the sum of 'value' and get the first element of 'key'. We can assign the 'ind' to NULL as it is not needed in the expected output.

library(data.table)
setDT(df1)[,list(value = sum(value), key=key[1L]),
                    by = .(ind=rleid(key), id)][, ind:=NULL][]
#   id value key
#1:  1  3000   A
#2:  1  2001   B
#3:  1  4006   A
#4:  2  2002   B
#5:  2  2002   C

Or as @Frank suggested, we can use mutliple columns within rleid, use it as grouping variable, get the first element of other variables and sum of 'value', assign the unwanted column to NULL to avoid copies.

setDT(df1)[, list(id=id[1L], key=key[1L], value=sum(value)) ,
        by = .(r=rleid(id, key))][, r:= NULL][]
#   id key value
#1:  1   A  3000
#2:  1   B  2001
#3:  1   A  4006
#4:  2   B  2002
#5:  2   C  2002

Or we can use dplyr. We create the grouping variable 'ind' by comparing the adjacent elements of 'key', and get the sum of 'value' and the first element of 'key' with summarise.

library(dplyr)
df1 %>%
     group_by(ind= cumsum(key!=lag(key, default=TRUE)), id) %>%
     summarise(value=sum(value), key=first(key)) %>% 
     ungroup() %>%
     select(-ind)
#  id value key
#1  1  3000   A
#2  1  2001   B
#3  1  4006   A
#4  2  2002   B
#5  2  2002   C

NOTE: In the dplyr and data.table we can also place the 'key' column as the grouping variable and remove the key=key[1L] or key=first(key)).

Or we transform the dataset by creating the 'ind' column and use aggregate from base R to get the expected output

df1 <- transform(df1, ind = cumsum(c(TRUE,head(key,-1)!=tail(key,-1))))
aggregate(value~., df1, FUN=sum)[-3]
#  id key value
#1  1   A  3000
#2  1   B  2001
#3  1   A  4006
#4  2   B  2002
#5  2   C  2002

data

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), key = c("A", 
"A", "B", "A", "A", "B", "C"), value = c(1000L, 2000L, 2001L, 
2002L, 2004L, 2002L, 2002L)), .Names = c("id", "key", "value"
), class = "data.frame", row.names = c(NA, -7L))

R: How to sum values from rows only if the key value is the same and also if the key duplicated in consecutive rows?

Tags:

r

data.table

Zeke

1 Answers

data

akrun

Recent Activity

Donate For Us

R: How to sum values from rows only if the key value is the same and also if the key duplicated in consecutive rows?

Tags:

r

data.table

Zeke

1 Answers

data

akrun

Related questions

Recent Activity

Donate For Us