I try to use the group by function of the data.table package in R.
start <- as.Date('2014-1-1')
end <- as.Date('2014-1-6')
time.span <- seq(start, end, "days")
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=c('a','a','b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> group V1
1: a 2.6667
2: b 4.3333
This works fine.
Since i am working with Dates it can happen that a special date not only has one but two groups.
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
date value group
1 2014-01-01 1 a
2 2014-01-02 2 c("a", "b")
3 2014-01-03 3 b
4 2014-01-04 4 b
5 2014-01-05 5 a
6 2014-01-06 6 b
a[,mean(value),by=group]
> Error in `[.data.table`(a, , mean(value), by = group) :
The items in the 'by' or 'keyby' list are length (1,2,1,1,1,1). Each must be same length as rows in x or number of rows returned by i (6).
I would like that the group date with both groups will be used for calculating the mean of group a as well as of group b.
Expected results:
mean a: 2.6667
mean b: 3.75
Is that possible with the data.table package?
Update
Thx to akrun my initial issue is solved. After "splitting" the data.table and in my case calculate different factors (based on the groups) i need the data.table back in its "original" form with unique rows based on the date. My solution so far:
a <- data.table(date = time.span, value=c(1,2,3,4,5,6), group=list('a',c('a','b'),'b','b','a','b'))
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
date value group
1 2014-01-01 1 a
2 2014-01-02 2 a
3 2014-01-02 2 b
4 2014-01-03 3 b
5 2014-01-04 4 b
6 2014-01-05 5 a
7 2014-01-06 6 b
# creates new column with mean based on group
b[,factor := mean(value), by=group]
#creates new data.table c without duplicate rows (based on date) + if a row has group a & b it creates the product of their factors
c <- b[,.(value = unique(value), group = list(group), factor = prod(factor)),by=date]
date value group factor
01/01/14 1 a 2.666666667
02/01/14 2 c("a", "b") 10
03/01/14 3 b 3.75
04/01/14 4 b 3.75
05/01/14 5 a 2.666666667
06/01/14 6 b 3.75
I guess it is not the perfect way to do it, but it works. Any suggestions how i could do it better?
Alternative solution (really slow!!!):
d <- a[rep(1:nrow(a), lengths(group))][,group:=unlist(a$group)][, mean(value), by = group]
for(i in 1:NROW(a)){
y1 <- 1
for(j in a[i,group][[1]]){
y1 <- y1 * d[group==j, V1]
}
a[i, factor := y1]
}
My fastest solution so far:
# split rows that more than one group
b <- a[rep(1:nrow(a), lengths(group))][, group:=unlist(a$group)]
# calculate mean of different groups
b <- b[,factor := mean(value), by=group]
# only keep date + factor columns
b <- b[,.(date, factor)]
# summarise rows by date
b <- b[,lapply(.SD,prod), by=date]
# add summarised factor column to initial data.table
c <- merge(a,b,by='date')
Any chance to make it faster?
One option would be to group by the row sequence, we unlist
the list
column ('group'), paste
the list
elements together (toString(..)
), use cSplit
from splitstackshape
with direction='long'
to reshape it into 'long' format, and then get the mean
of the 'value' column using 'grp' as the grouping variable.
library(data.table)
library(splitstackshape)
a[, grp:= toString(unlist(group)), 1:nrow(a)]
cSplit(a, 'grp', ', ', 'long')[, mean(value), grp]
# grp V1
#1: a 2.666667
#2: b 3.750000
Just realized that another option using splitstackshape
would be listCol_l
which unlist
s a list
column into long form. As the output is a data.table
, we can use the data.table
methods to calculate the mean
. It is much more compact to get the mean
.
listCol_l(a, 'group')[, mean(value), group_ul]
# group_ul V1
#1: a 2.666667
#2: b 3.750000
Or another option without using splitstackshape
would be to replicate the rows of the dataset by the length
of the list
element. The lengths
is a convenient wrapper for sapply(group, length)
and is much faster. Then, we change the 'group' column by unlist
ing the original 'group' from 'a' dataset and get the mean
of 'value', grouped by 'group'.
a[rep(1:nrow(a), lengths(group))][,
group:=unlist(a$group)][, mean(value), by = group]
# group V1
#1: a 2.666667
#2: b 3.750000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With