Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Animals in the zoo: can we aggregate a daily time series of factors and flag activity by ID?

Suppose there is a daily time series of animal activity in a zoo over many years. A subset of a very large dataset might look like this:

library(data.table)
type <- c(rep('giraffe',90),rep('monkey',90),rep('anteater',90))
status <- as.factor(c(rep('display',31),rep('caged',28),rep('display',31),
rep('caged',25), rep('display',35),rep('caged',30),rep('caged',10),
rep('display',10),rep('caged',10),rep('display',60)))
date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"), "day" ),3)

Where 'type' is the animal type and 'status' is an indicator of what the animal is doing that day, for example, caged or on display.

animals <-  data.table(type,status,date);animals
         type  status       date
  1:  giraffe display 2001-01-01
  2:  giraffe display 2001-01-02
  3:  giraffe display 2001-01-03
  4:  giraffe display 2001-01-04
  5:  giraffe display 2001-01-05
 ---                            
266: anteater display 2001-03-27
267: anteater display 2001-03-28
268: anteater display 2001-03-29
269: anteater display 2001-03-30
270: anteater display 2001-03-31

Suppose we want to aggregate this to a monthly series that lists the animals with information on their status for the entire month. In the new series, "status" reflects the status of the animal at the first of the month. "fullmonth" is a binary variable (1=TRUE,0=FALSE) that indicates whether this status held for the entire month and "anydisp" is a binary variable (1=TRUE, 0= FALSE) that indicates whether the animal was on display at any time during the month (>= 1 day). So, because the giraffe was on display for the full months of Jan. and Mar. but caged during Feb. it gets marked accordingly.

date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"),"month"),3)
type <- c(rep('giraffe',3),rep('monkey',3),rep('anteater',3))
status <- as.factor(c('display','caged','display','caged','display','caged',
'caged','display','display'))
fullmonth <- c(1,1,1,0,1,0,0,1,1)
anydisp <- c(1,0,1,1,1,1,1,1,1)

animals2 <- data.table(date,type,status,fullmonth,anydisp);animals2
     date     type  status fullmonth anydisp
2001-01-01  giraffe display         1   1
2001-02-01  giraffe   caged         1   0
2001-03-01  giraffe display         1   1
2001-01-01   monkey   caged         0   1
2001-02-01   monkey display         1   1
2001-03-01   monkey   caged         0   1
2001-01-01 anteater   caged         0   1
2001-02-01 anteater display         1   1
2001-03-01 anteater display         1   1

I thought zoo might be the way to go but after playing around I found it doesn't handle non-numeric values very well and even if I assign arbitrary values to the qualitative component (status) it's not clear how it will solve the problem.

##aggregate function with zoo? 
library(zoo)
animals$activity <- as.numeric(ifelse(status=='display',1,0))
animals2 <- subset(animals, select=c(date,activity))
datas <- zoo(animals2)
monthlyzoo <- aggregate(datas,as.yearmon,sum)
Error in Summary.factor(1L, na.rm = FALSE) : 
  sum not meaningful for factors

Is anyone aware of a solution using sqldf or data.table ?

Update

Would like to add a new requirement that the date shown be the first of the month even if the data starts later in the month. For example, this data set illustrates such a situation:

animals2 <- animals[30:270,];head(animals2)

setkey(animals2, "type", "date")

oo <- animals2[, list(date=date[1], status = status[1],
                      fullmonth = 1 * all(status == status[1]),
                      anydisplay = any(status == "display") * 1 ),
               by = list(month(date), type)][, month := NULL]
oo

      type       date  status fullmonth anydisplay
1: anteater 2001-01-30   caged         0          1
2: anteater 2001-02-01 display         1          1
3: anteater 2001-03-01 display         1          1
4:  giraffe 2001-01-01 display         1          1
5:  giraffe 2001-02-01   caged         1          0
6:  giraffe 2001-03-01 display         1          1
7:   monkey 2001-01-01   caged         0          1 
8:   monkey 2001-02-01 display         1          1
9:   monkey 2001-03-01 display         0          1

sqldf("select 
    min(date) date, 
    type,
    status, 
    max(status) = min(status) fullmonth,
    sum(status = 'display') > 0 anydisp
from animals2
group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch')
order by type, date")

        date     type  status fullmonth anydisp
1 2001-01-30 anteater   caged         0       1
2 2001-02-01 anteater display         1       1
3 2001-03-01 anteater display         1       1
4 2001-01-01  giraffe display         1       1
5 2001-02-01  giraffe   caged         1       0
6 2001-03-01  giraffe display         1       1
7 2001-01-01   monkey   caged         0       1
8 2001-02-01   monkey display         1       1
9 2001-03-01   monkey   caged         0       1

This can be accommodated by post processing any of the solution to revise the date:

dateswitch <- paste(year(animals2$date),month(animals2$date),1,sep='/')
dateswitch <- as.Date(dateswitch, "%Y/%m/%d")
animals2$date <- as.Date(dateswitch)
like image 935
hubert_farnsworth Avatar asked May 12 '13 06:05

hubert_farnsworth


2 Answers

Something like this?

setkey(animals, "type", "date")
oo <- animals[, list(date=date[1], status = status[1], 
                     fullmonth = 1 * all(status == status[1]), 
                     anydisplay = any(status == "display") * 1), 
by = list(month(date), type)][, month := NULL]
#        type       date  status fullmonth anydisplay
# 1: anteater 2001-01-01   caged         0          1
# 2: anteater 2001-02-01 display         1          1
# 3: anteater 2001-03-01 display         1          1
# 4:  giraffe 2001-01-01 display         1          1
# 5:  giraffe 2001-02-01   caged         1          0
# 6:  giraffe 2001-03-01 display         1          1
# 7:   monkey 2001-01-01   caged         0          1
# 8:   monkey 2001-02-01 display         1          1
# 9:   monkey 2001-03-01 display         0          1
like image 176
Arun Avatar answered Sep 22 '22 03:09

Arun


Here is an sqldf solution:

library(sqldf)

# define input data.frame where type, status and date variables are defined in question
animals <-  data.frame(type,status,date)

sqldf("select 
    min(date) date, 
    type,
    status, 
    max(status) = min(status) fullmonth,
    sum(status = 'display') > 0 anydisp
from animals
group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch')
order by type, date")

The output of this command with the data shown is:

        date     type  status fullmonth anydisp
1 2001-01-01 anteater   caged         0       1
2 2001-02-01 anteater display         1       1
3 2001-03-01 anteater display         1       1
4 2001-01-01  giraffe display         1       1
5 2001-02-01  giraffe   caged         1       0
6 2001-03-01  giraffe display         1       1
7 2001-01-01   monkey   caged         0       1
8 2001-02-01   monkey display         1       1
9 2001-03-01   monkey   caged         0       1

ADDED: The poster later added to the question an additional requirement to show the date as the first of the month even if the data did not start until later in that month. If DF is the result of the sqldf statement above then transform it like this:

library(zoo)
transform(DF, date = as.Date(as.yearmon(date)))

or it might be preferable to eliminate the day part (since it may be regarded as misleading if there is no data for that date anyways) and just give the year and month using "yearmon" class:

library(zoo)
transform(DF, date = as.yearmon(date))
like image 44
G. Grothendieck Avatar answered Sep 26 '22 03:09

G. Grothendieck