I have the following data table of this structure:
+-------------------+
| id | key | value |
+-----+-----+-------+
| 1 | A | 1000 |
| 1 | A | 2000 |
| 1 | B | 2001 |
| 1 | A | 2002 |
| 1 | A | 2004 |
| 2 | B | 2002 |
| 2 | C | 2002 |
+-------------------+
My objective is to sum the values by id and key but instead of just grouping by id and key, I only want to sum the values if the pair of id and key are the same for consecutive rows.
The result should be:
+-------------------+
| id | key | value |
+-----+-----+-------+
| 1 | A | 3000 |
| 1 | B | 2001 |
| 1 | A | 4006 |
| 2 | B | 2002 |
| 2 | C | 2002 |
+-------------------+
Is there anyway to achieve this result?
We can use rleid from data.table.
We convert the 'data.frame' to 'data.table'. Create another grouping column 'ind' from the 'key' column. Grouped by 'id' and 'ind', we get the sum of 'value' and get the first element of 'key'. We can assign the 'ind' to NULL as it is not needed in the expected output.
library(data.table)
setDT(df1)[,list(value = sum(value), key=key[1L]),
by = .(ind=rleid(key), id)][, ind:=NULL][]
# id value key
#1: 1 3000 A
#2: 1 2001 B
#3: 1 4006 A
#4: 2 2002 B
#5: 2 2002 C
Or as @Frank suggested, we can use mutliple columns within rleid, use it as grouping variable, get the first element of other variables and sum of 'value', assign the unwanted column to NULL to avoid copies.
setDT(df1)[, list(id=id[1L], key=key[1L], value=sum(value)) ,
by = .(r=rleid(id, key))][, r:= NULL][]
# id key value
#1: 1 A 3000
#2: 1 B 2001
#3: 1 A 4006
#4: 2 B 2002
#5: 2 C 2002
Or we can use dplyr. We create the grouping variable 'ind' by comparing the adjacent elements of 'key', and get the sum of 'value' and the first element of 'key' with summarise.
library(dplyr)
df1 %>%
group_by(ind= cumsum(key!=lag(key, default=TRUE)), id) %>%
summarise(value=sum(value), key=first(key)) %>%
ungroup() %>%
select(-ind)
# id value key
#1 1 3000 A
#2 1 2001 B
#3 1 4006 A
#4 2 2002 B
#5 2 2002 C
NOTE: In the dplyr and data.table we can also place the 'key' column as the grouping variable and remove the key=key[1L] or key=first(key)).
Or we transform the dataset by creating the 'ind' column and use aggregate from base R to get the expected output
df1 <- transform(df1, ind = cumsum(c(TRUE,head(key,-1)!=tail(key,-1))))
aggregate(value~., df1, FUN=sum)[-3]
# id key value
#1 1 A 3000
#2 1 B 2001
#3 1 A 4006
#4 2 B 2002
#5 2 C 2002
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), key = c("A",
"A", "B", "A", "A", "B", "C"), value = c(1000L, 2000L, 2001L,
2002L, 2004L, 2002L, 2002L)), .Names = c("id", "key", "value"
), class = "data.frame", row.names = c(NA, -7L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With