Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting irregular time series into regular monthly averages - R

In order to establish seasonal effects on energy use, I need to align the energy use information that I have from a billing database with monthly temperatures.

I'm working with a billing dataset that has bills of varying lengths and start and end dates, and I'd like to obtain the monthly average for each account within each month. For example, I have a billing database that has the following characteristics:

   acct amount      begin        end days
1  2242  11349 2009-10-06 2009-11-04   29
2  2242  12252 2009-11-04 2009-12-04   30
3  2242  21774 2009-12-04 2010-01-08   35
4  2242  18293 2010-01-08 2010-02-05   28
5  2243  27217 2009-10-06 2009-11-04   29
6  2243    117 2009-11-04 2009-12-04   30
7  2243  14543 2009-12-04 2010-01-08   35

I would like to figure out how to coerce these somewhat irregular time series (for each account) to get the average amount per day within each month that is spanned within each bill, such that:

   acct amount      begin        end days avgamtpday
1  2242  11349 2009-10-01 2009-10-31   31          X
2  2242  12252 2009-11-01 2009-11-30   30          X
3  2242  21774 2009-12-01 2010-12-31   31          X
4  2242  18293 2010-01-01 2010-01-31   31          X
4  2242  18293 2010-02-01 2010-02-28   28          X
5  2243  27217 2009-10-01 2009-10-31   31          X
6  2243    117 2009-11-01 2009-11-30   30          X
7  2243  14543 2009-12-01 2009-12-31   30          X
7  2243  14543 2010-01-01 2010-01-31   31          X

I'm fairly agnostic to whichever tool can do this, since I only have to do this once.

An additional wrinkle is the table is about 150,000 rows long, which is not really very big by most standards, but big enough to make a loop solution in R difficult. I've investigated using the zoo, xts, and tempdisagg packages in R. I started writing a really ugly loop that would split each bill, then create one row for each month within an existing bill, and then tapply() to summarize by accts and months, but honestly, couldn't see how to do it efficiently.

In MySQL, I've tried this:

create or replace view v3 as select 1 n union all select 1 union all select 1;
create or replace view v as select 1 n from v3 a, v3 b union all select 1;
set @n = 0;
drop table if exists calendar; create table calendar(dt date primary key);
insert into calendar
select cast('2008-1-1' + interval @n:=@n+1 day as date) as dt from v a, v b, v c, v d, v e, v;

select acct, amount, begin, end, billAmtPerDay, sum(billAmtPerDay), MonthAmt, count() Days, sum(billAmtPerDay)/count() AverageAmtPerDay, year(dt), month(dt) FROM ( select *, amount/days billAmtPerDay from bills b inner join calendar c on dt between begin and end and begin <> dt) x group by acct, amount, begin, end, billAmtPerDay, year(dt), month(dt);

But for reasons I don't understand, my server doesn't like this table, and gets hung up on the inner join, even when I stage the different calculations. I'm investigating if there are any temporary memory limits on it.

Thanks!

like image 584
bikeclub Avatar asked Sep 24 '12 01:09

bikeclub


2 Answers

Here's a start using data.table :

billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)

require(data.table)
DT = as.data.table(billdata)

First, change type of columns begin and end to dates. Unlike data.frame, this doesn't copy the entire dataset.

DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]

Then find the time span, find the prevailing bill for each day, and aggregate.

alldays = DT[,seq(min(begin),max(end),by="day")]

setkey(DT, acct, begin)

DT[CJ(unique(acct),alldays),
   mean(amount/days,na.rm=TRUE),
   by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

    acct   month        V1
 1: 2242 2009-10 391.34483
 2: 2242 2009-11 406.69448
 3: 2242 2009-12 601.43226
 4: 2242 2010-01 646.27465
 5: 2242 2010-02 653.32143
 6: 2243 2009-10 938.51724
 7: 2243 2009-11  97.36172
 8: 2243 2009-12 375.68065
 9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429

I think you'll find the prevailing join logic quite cumbersome in SQL, and slower.

I say it's a hint because it's not quite correct. Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could rbind in the last row for each account and use rolltolast instead of roll. Or perhaps create alldays by account rather than across all accounts.

See if speed is acceptable on the above, and we can go from there.

It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. I'm using v1.8.3.

"Internal" error message when combining join containing missing groups and group by is fixed, #2162. For example : X[Y,.N,by=NonJoinColumn] where Y contains some rows that don't match to X. This bug could also result in a seg fault.

Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge.

Btw, nice example data. That made it quicker to answer.


Here's the full answer alluded to above. It's a bit tricky I have to admit, as it combines together several features of data.table. This should work in 1.8.2 as it happens, but I've only tested in 1.8.3.

DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
    mean(amount/days,na.rm=TRUE),
    by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

   acct   month        V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11  97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
like image 138
Matt Dowle Avatar answered Sep 27 '22 15:09

Matt Dowle


Here is one way to do it:

billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)

#First, declare your columns "begin" and "end" as dates:
strptime(billdata$begin, format="%Y-%m-%d") -> billdata$begin
strptime(billdata$end, format="%Y-%m-%d") -> billdata$end

#Then create a column with the amount per day on the billing period:
billdata$avg_on_period<-billdata$amount/billdata$days

#Then split it into days:
temp <- data.frame(acct=c(),month=c(),day=c(), avg=c())
for(i in 1:nrow(billdata)){
    X <- billdata[i,]
    seq(X$begin,X$end,by="day") -> list_day
    rbind(temp, data.frame(acct=rep(X$acct,length(list_day)), 
        month=format(list_day, "%Y-%m"), day=format(list_day, "%d"), 
        avg=rep(X$avg_on_period, length(list_day)))) -> temp
    }

# And finally merge the different days of the months together:
output<-aggregate(temp$avg, by=list(temp$month,temp$acct), FUN=mean)

colnames(output) <- c("Month","Account","Average per day")

output
    Month Account Average per day
1 2009-10    2242       391.34483
2 2009-11    2242       406.69448
3 2009-12    2242       595.40000
4 2010-01    2242       645.51964
5 2010-02    2242       653.32143
6 2009-10    2243       938.51724
7 2009-11    2243        97.36172
8 2009-12    2243       364.06250
9 2010-01    2243       415.51429
like image 30
plannapus Avatar answered Sep 27 '22 15:09

plannapus