Suppose there is a daily time series of animal activity in a zoo over many years. A subset of a very large dataset might look like this:
library(data.table)
type <- c(rep('giraffe',90),rep('monkey',90),rep('anteater',90))
status <- as.factor(c(rep('display',31),rep('caged',28),rep('display',31),
rep('caged',25), rep('display',35),rep('caged',30),rep('caged',10),
rep('display',10),rep('caged',10),rep('display',60)))
date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"), "day" ),3)
Where 'type' is the animal type and 'status' is an indicator of what the animal is doing that day, for example, caged or on display.
animals <- data.table(type,status,date);animals
type status date
1: giraffe display 2001-01-01
2: giraffe display 2001-01-02
3: giraffe display 2001-01-03
4: giraffe display 2001-01-04
5: giraffe display 2001-01-05
---
266: anteater display 2001-03-27
267: anteater display 2001-03-28
268: anteater display 2001-03-29
269: anteater display 2001-03-30
270: anteater display 2001-03-31
Suppose we want to aggregate this to a monthly series that lists the animals with information on their status for the entire month. In the new series, "status" reflects the status of the animal at the first of the month. "fullmonth" is a binary variable (1=TRUE,0=FALSE) that indicates whether this status held for the entire month and "anydisp" is a binary variable (1=TRUE, 0= FALSE) that indicates whether the animal was on display at any time during the month (>= 1 day). So, because the giraffe was on display for the full months of Jan. and Mar. but caged during Feb. it gets marked accordingly.
date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"),"month"),3)
type <- c(rep('giraffe',3),rep('monkey',3),rep('anteater',3))
status <- as.factor(c('display','caged','display','caged','display','caged',
'caged','display','display'))
fullmonth <- c(1,1,1,0,1,0,0,1,1)
anydisp <- c(1,0,1,1,1,1,1,1,1)
animals2 <- data.table(date,type,status,fullmonth,anydisp);animals2
date type status fullmonth anydisp
2001-01-01 giraffe display 1 1
2001-02-01 giraffe caged 1 0
2001-03-01 giraffe display 1 1
2001-01-01 monkey caged 0 1
2001-02-01 monkey display 1 1
2001-03-01 monkey caged 0 1
2001-01-01 anteater caged 0 1
2001-02-01 anteater display 1 1
2001-03-01 anteater display 1 1
I thought zoo
might be the way to go but after playing around I found it doesn't handle non-numeric values very well and even if I assign arbitrary values to the qualitative component (status) it's not clear how it will solve the problem.
##aggregate function with zoo?
library(zoo)
animals$activity <- as.numeric(ifelse(status=='display',1,0))
animals2 <- subset(animals, select=c(date,activity))
datas <- zoo(animals2)
monthlyzoo <- aggregate(datas,as.yearmon,sum)
Error in Summary.factor(1L, na.rm = FALSE) :
sum not meaningful for factors
Is anyone aware of a solution using sqldf
or data.table
?
Would like to add a new requirement that the date shown be the first of the month even if the data starts later in the month. For example, this data set illustrates such a situation:
animals2 <- animals[30:270,];head(animals2)
setkey(animals2, "type", "date")
oo <- animals2[, list(date=date[1], status = status[1],
fullmonth = 1 * all(status == status[1]),
anydisplay = any(status == "display") * 1 ),
by = list(month(date), type)][, month := NULL]
oo
type date status fullmonth anydisplay
1: anteater 2001-01-30 caged 0 1
2: anteater 2001-02-01 display 1 1
3: anteater 2001-03-01 display 1 1
4: giraffe 2001-01-01 display 1 1
5: giraffe 2001-02-01 caged 1 0
6: giraffe 2001-03-01 display 1 1
7: monkey 2001-01-01 caged 0 1
8: monkey 2001-02-01 display 1 1
9: monkey 2001-03-01 display 0 1
sqldf("select
min(date) date,
type,
status,
max(status) = min(status) fullmonth,
sum(status = 'display') > 0 anydisp
from animals2
group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch')
order by type, date")
date type status fullmonth anydisp
1 2001-01-30 anteater caged 0 1
2 2001-02-01 anteater display 1 1
3 2001-03-01 anteater display 1 1
4 2001-01-01 giraffe display 1 1
5 2001-02-01 giraffe caged 1 0
6 2001-03-01 giraffe display 1 1
7 2001-01-01 monkey caged 0 1
8 2001-02-01 monkey display 1 1
9 2001-03-01 monkey caged 0 1
This can be accommodated by post processing any of the solution to revise the date:
dateswitch <- paste(year(animals2$date),month(animals2$date),1,sep='/')
dateswitch <- as.Date(dateswitch, "%Y/%m/%d")
animals2$date <- as.Date(dateswitch)
Something like this?
setkey(animals, "type", "date")
oo <- animals[, list(date=date[1], status = status[1],
fullmonth = 1 * all(status == status[1]),
anydisplay = any(status == "display") * 1),
by = list(month(date), type)][, month := NULL]
# type date status fullmonth anydisplay
# 1: anteater 2001-01-01 caged 0 1
# 2: anteater 2001-02-01 display 1 1
# 3: anteater 2001-03-01 display 1 1
# 4: giraffe 2001-01-01 display 1 1
# 5: giraffe 2001-02-01 caged 1 0
# 6: giraffe 2001-03-01 display 1 1
# 7: monkey 2001-01-01 caged 0 1
# 8: monkey 2001-02-01 display 1 1
# 9: monkey 2001-03-01 display 0 1
Here is an sqldf solution:
library(sqldf)
# define input data.frame where type, status and date variables are defined in question
animals <- data.frame(type,status,date)
sqldf("select
min(date) date,
type,
status,
max(status) = min(status) fullmonth,
sum(status = 'display') > 0 anydisp
from animals
group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch')
order by type, date")
The output of this command with the data shown is:
date type status fullmonth anydisp
1 2001-01-01 anteater caged 0 1
2 2001-02-01 anteater display 1 1
3 2001-03-01 anteater display 1 1
4 2001-01-01 giraffe display 1 1
5 2001-02-01 giraffe caged 1 0
6 2001-03-01 giraffe display 1 1
7 2001-01-01 monkey caged 0 1
8 2001-02-01 monkey display 1 1
9 2001-03-01 monkey caged 0 1
ADDED: The poster later added to the question an additional requirement to show the date as the first of the month even if the data did not start until later in that month. If DF
is the result of the sqldf statement above then transform it like this:
library(zoo)
transform(DF, date = as.Date(as.yearmon(date)))
or it might be preferable to eliminate the day part (since it may be regarded as misleading if there is no data for that date anyways) and just give the year and month using "yearmon"
class:
library(zoo)
transform(DF, date = as.yearmon(date))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With