I have the following data table of this structure:
+-------------------+
| id | key | value |
+-----+-----+-------+
| 1 | A | 1000 |
| 1 | A | 2000 |
| 1 | B | 2001 |
| 1 | A | 2002 |
| 1 | A | 2004 |
| 2 | B | 2002 |
| 2 | C | 2002 |
+-------------------+
My objective is to sum the values by id and key but instead of just grouping by id and key, I only want to sum the values if the pair of id and key are the same for consecutive rows.
The result should be:
+-------------------+
| id | key | value |
+-----+-----+-------+
| 1 | A | 3000 |
| 1 | B | 2001 |
| 1 | A | 4006 |
| 2 | B | 2002 |
| 2 | C | 2002 |
+-------------------+
Is there anyway to achieve this result?
We can use rleid
from data.table
.
We convert the 'data.frame' to 'data.table'. Create another grouping column 'ind' from the 'key' column. Grouped by 'id' and 'ind', we get the sum
of 'value' and get the first element of 'key'. We can assign the 'ind' to NULL as it is not needed in the expected output.
library(data.table)
setDT(df1)[,list(value = sum(value), key=key[1L]),
by = .(ind=rleid(key), id)][, ind:=NULL][]
# id value key
#1: 1 3000 A
#2: 1 2001 B
#3: 1 4006 A
#4: 2 2002 B
#5: 2 2002 C
Or as @Frank suggested, we can use mutliple columns within rleid
, use it as grouping variable, get the first element of other variables and sum
of 'value', assign the unwanted column to NULL
to avoid copies.
setDT(df1)[, list(id=id[1L], key=key[1L], value=sum(value)) ,
by = .(r=rleid(id, key))][, r:= NULL][]
# id key value
#1: 1 A 3000
#2: 1 B 2001
#3: 1 A 4006
#4: 2 B 2002
#5: 2 C 2002
Or we can use dplyr
. We create the grouping variable 'ind' by comparing the adjacent elements of 'key', and get the sum
of 'value' and the first
element of 'key' with summarise
.
library(dplyr)
df1 %>%
group_by(ind= cumsum(key!=lag(key, default=TRUE)), id) %>%
summarise(value=sum(value), key=first(key)) %>%
ungroup() %>%
select(-ind)
# id value key
#1 1 3000 A
#2 1 2001 B
#3 1 4006 A
#4 2 2002 B
#5 2 2002 C
NOTE: In the dplyr
and data.table
we can also place the 'key' column as the grouping variable and remove the key=key[1L]
or key=first(key))
.
Or we transform
the dataset by creating the 'ind' column and use aggregate
from base R
to get the expected output
df1 <- transform(df1, ind = cumsum(c(TRUE,head(key,-1)!=tail(key,-1))))
aggregate(value~., df1, FUN=sum)[-3]
# id key value
#1 1 A 3000
#2 1 B 2001
#3 1 A 4006
#4 2 B 2002
#5 2 C 2002
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L), key = c("A",
"A", "B", "A", "A", "B", "C"), value = c(1000L, 2000L, 2001L,
2002L, 2004L, 2002L, 2002L)), .Names = c("id", "key", "value"
), class = "data.frame", row.names = c(NA, -7L))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With