Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dcast.data.table with date values and aggregation

Trying to figure this out. Suppose you have a data.table:

dt <- data.table (person=c('bob', 'bob', 'bob'), 
                  door=c('front door', 'front door', 'front door'),
                  type=c('timeIn', 'timeIn', 'timeOut'),
                  time=c(
as.POSIXct('2016 12 02 06 05 01', format = '%Y %m %d %H %M %S'),
as.POSIXct('2016 12 02 06 05 02', format = '%Y %m %d %H %M %S'),
as.POSIXct('2016 12 02 06 05 03', format = '%Y %m %d %H %M %S')                     )
)

I want to pivot it to look like this

person        door        timeIn             timeOut

bob           front door  min(<date/time>) max(<date/time>)

I can't seem to get the right syntax for dcast.data.table. I tried

dcast.data.table(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) ifelse(type == 'timeIn', min(x), max(x))
)

which throws an error:

Aggregating function(s) should take vector inputs and return a single value (length=1).

I also tried:

 dcast.data.table(dt, person + door ~ type, value.var = 'time')

But the result throws away my dates

   person       door timeIn timeOut
1:    bob front door      2       1

Any suggestions would be appreciated. TIA

like image 562
ds_practicioner Avatar asked Dec 23 '16 22:12

ds_practicioner


2 Answers

There are several ways to achieve the desired result using dcast. jazzurro's solution does the aggregation before reshaping the result. The approaches here use dcast directly but may require some post-processing. We are using jazzurro's data which are tweaked to obey the UTC time zone and CRAN version 1.10.0 of data.table.

1. Getting ifelse to work

As reported in the Q,

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) ifelse(type == 'timeIn', min(x), max(x))
)

returns an error message. The full text of the error message includes the hint to use the fill parameter. Unfortunately, ifelse() doesn't respect the POSIXct class (for details see ?ifelse) so this needs to be enforced.

With

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) 
    lubridate::as_datetime(ifelse(type == 'timeIn', min(x), max(x))),
  fill = 0
)

we do get

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

2. Alternative to ifelse

ifelse's help page suggests

(tmp <- yes; tmp[!test] <- no[!test]; tmp)

as alternative. Following this advice,

dcast(
  dt, person + door ~ type, 
  value.var = 'time', 
  fun.aggregate = function(x) {
    test <- type == "timeIn"; tmp <- min(x); tmp[!test] = max(x)[!test]; tmp
    }
)

returns

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

Note that neither the fill parameter nor the coercion to POSIXct is needed.

3. Using enhanced dcast

With the latest versions of dcast.data.table we can provide a list of functions to fun.aggregate:

dcast(dt, person + door ~ type, value.var = 'time', fun = list(min, max))

returns

#   person       door     time_min_timeIn    time_min_timeOut     time_max_timeIn    time_max_timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:03 2016-12-02 07:06:02 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:03 2016-12-02 06:05:02 2016-12-02 06:05:05

We can remove the unwanted columns and rename the others by

dcast(dt, person + door ~ type, value.var = 'time', fun = list(min, max))[
  , .(person, door, timeIn = time_min_timeIn, timeOut = time_max_timeOut)]

which gets us

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

Data

As mentioned above, we are using jazzurro's data

dt <- structure(list(person = c("bob", "bob", "bob", "bob", "ana", 
"ana", "ana", "ana"), door = c("front door", "front door", "front door", 
"front door", "front door", "front door", "front door", "front door"
), type = c("timeIn", "timeIn", "timeOut", "timeOut", "timeIn", 
"timeIn", "timeOut", "timeOut"), time = structure(c(1480658701, 
1480658702, 1480658703, 1480658705, 1480662361, 1480662362, 1480662363, 
1480662365), class = c("POSIXct", "POSIXt"))), .Names = c("person", 
"door", "type", "time"), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

but coerce the time zone to UTC.

With

dt[, time := lubridate::with_tz(time, "UTC")]

we have

dt
#   person       door    type                time
#1:    bob front door  timeIn 2016-12-02 06:05:01
#2:    bob front door  timeIn 2016-12-02 06:05:02
#3:    bob front door timeOut 2016-12-02 06:05:03
#4:    bob front door timeOut 2016-12-02 06:05:05
#5:    ana front door  timeIn 2016-12-02 07:06:01
#6:    ana front door  timeIn 2016-12-02 07:06:02
#7:    ana front door timeOut 2016-12-02 07:06:03
#8:    ana front door timeOut 2016-12-02 07:06:05

independent of local time zone.

like image 95
Uwe Avatar answered Nov 15 '22 05:11

Uwe


This would be one way to achieve your goal. I modified your dt and created the following data set. For each person, I looked for the minimum time of timeIn and the maximum time of timeOut. Then, I applied dcast() to the result.

#   person       door    type                time
#1:    bob front door  timeIn 2016-12-02 06:05:01
#2:    bob front door  timeIn 2016-12-02 06:05:02
#3:    bob front door timeOut 2016-12-02 06:05:03
#4:    bob front door timeOut 2016-12-02 06:05:05
#5:    ana front door  timeIn 2016-12-02 07:06:01
#6:    ana front door  timeIn 2016-12-02 07:06:02
#7:    ana front door timeOut 2016-12-02 07:06:03
#8:    ana front door timeOut 2016-12-02 07:06:05

library(data.table)

dcast(
   dt[, .SD[(type == "timeIn" & time == min(time))|(type == "timeOut" & time == max(time))], by = person],
   person + door ~ type)

#   person       door              timeIn             timeOut
#1:    ana front door 2016-12-02 07:06:01 2016-12-02 07:06:05
#2:    bob front door 2016-12-02 06:05:01 2016-12-02 06:05:05

DATA

dt <- structure(list(person = c("bob", "bob", "bob", "bob", "ana", 
"ana", "ana", "ana"), door = c("front door", "front door", "front door", 
"front door", "front door", "front door", "front door", "front door"
), type = c("timeIn", "timeIn", "timeOut", "timeOut", "timeIn", 
"timeIn", "timeOut", "timeOut"), time = structure(c(1480658701, 
1480658702, 1480658703, 1480658705, 1480662361, 1480662362, 1480662363, 
1480662365), class = c("POSIXct", "POSIXt"))), .Names = c("person", 
"door", "type", "time"), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))
like image 35
jazzurro Avatar answered Nov 15 '22 06:11

jazzurro