Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Translating time stamps (start, end) into time series data. Errors with align.time() and colnames

I am new to R, but after taking an intro course and playing with it a bit, I'm hopeful that it can 1) more elegantly solve my modelling objectives (compared to Excel, which is my backup plan) and 2) be a useful skill to take away from this project.

The task/objective:

I am attempting to use driving diary data to simulate and model potential energy and GHG emissions from electric cars. Specifically:

  1. I have driving diary data (start and end time stamps, plus other data of thousands of drivers -- basic sample below) that I want to translate into:
  2. 24-hour time series data, such that for each minute of a 24-hour period, I know exactly who is driving a vehicle, and what 'trip' that it belongs to (for that driver). My problem here focuses on this issue.

The type of output I would like: NOTE: this output is NOT related to the sample data provided below. I used the first ten minutes of a certain day with some theoretical trips just as an example

enter image description here

Not essential to this problem, but may be useful to know: I will use the above output to cross-reference other driver-specific data to calculate minute-by-minute consumption of gasoline (or electricity) based on things associated with that trip, such as parking location or trip distance. I would like to do this in R but must first figure out the above problem before I move onto this step.

The solution I have so far is based on:

  • How to count the number of concurrent users using time interval data?
  • How to calculate number of occurrences per minute for a large dataset

The problem:

Example simplified data:

a <- c("A","A","A","B","B","B","C","C","C")
b <- c(1, 2, 3, 1, 2, 3, 1, 2, 3)
c <- as.POSIXct(c(0.29167, 0.59375, 0.83333, 0.45833, 0.55347, 0.27083, 0.34375, 0.39236, 0.35417)*24*3600 + as.POSIXct("2013-1-1 00:00") )
d <- as.POSIXct(c(0.334027778, 0.614583333, 0.875, 0.461805556, 0.563888889, 0.295138889, 0.375, 0.503472222, 0.364583333)*24*3600 + as.POSIXct("2013-1-1 00:00"))
e <- c(2, 8, 2, 5, 5, 2, 5, 5, 2)
f <- as.POSIXct(c(0, 0.875, 0, 0.479166666666667, 0.580555555555556, 0.489583333333333, 0.430555555555556, 0.541666666666667, 0.711805555555555)*24*3600 + as.POSIXct("2013-1-1 00:00"))
g <- as.POSIXct(c(0, 0.885, 0, 0.482638888888889, 0.588194444444444, 0.496527777777778, 0.454861111111111, 0.559027777777778, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
h <- c(0, 1, 0, 1, 4, 8, 8, 1, 5)
i <- as.POSIXct(c(0, 0, 0, 0.729166666666667, 0.595833333333333, 0.534722222222222, 0.59375, 0.779861111111111, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
j <- as.POSIXct(c(0, 0, 0, 0.736111111111111, 0.605555555555556, 0.541666666666667, 0.611111111111111, 0.788194444444445, 0.75625)*24*3600 + as.POSIXct("2013-1-1 00:00"))
k <- c(0, 0, 0, 4, 4, 2, 5, 8,1)
testdata <- data.frame(a,b,c,d,e,f,g,h,i,j,k)
names(testdata) <- c("id", "Day", "trip1_start", "trip1_end", "trip1_purpose", "trip2_start", "trip2_end", "trip2_purpose", "trip3_start", "trip3_end", "trip3_purpose")

In this example data, I have three drivers (id = A, B, C) who each drive on three different days (day = 1, 2, 3). Note that some drivers may have different numbers of trips. The time stamps indicate start and end time of driving activities.

I then create minute intervals for a entire day (January 1, 2013)

start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
end.max <- as.POSIXct("2013-01-01 23:59:59 PST")
tinterval <- seq.POSIXt(start.min, end.max, na.rm=T, by = "mins")

Insert "1" during minutes where a given user is driving:

out1 <- xts(,align.time(tinterval,60))
# loop over each user
for(i in 1:NROW(testdata)) {
  # paste the start / end times into an xts-style range
  timeRange <- paste(format(testdata[i,c("trip1_start","trip1_end")]),collapse="/")
  # add the minute "by parameter" for timeBasedSeq
  timeRange <- paste(timeRange,"M",sep="/")
  # create the by-minute sequence and align to minutes to match "out"
  timeSeq <- align.time(timeBasedSeq(timeRange),60)
  # create xts object with "1" entries for times between start and end
  temp1 <- xts(rep(1,length(timeSeq)),timeSeq)
  # merge temp1 with out and fill non-matching timestamps with "0"
  out1 <- merge(out1, temp1, fill=0)
}
# add column names
colnames(out1) <- paste(testdata[,1], testdata[,2], sep = ".")

The idea is to then repeat this for each trip, e.g. out2, out3, etc. wherein I would fill any driving periods with "2", "3", etc., and then sum/merge all of the resulting outx dataframes, and eventually get the desired result.

Unfortunately when I try to repeat this for out2...

out2 <- xts(,align.time(tinterval,60))
for(i in 1:NROW(testdata)) {
  timeRange2 <- paste(format(testdata[i,c("trip2_start","trip2_end")]),collapse="/")
  timeRange2 <- paste(timeRange2,"M",sep="/")
  timeSeq2 <- align.time(timeBasedSeq(timeRange2),60)
  temp2 <- xts(rep(2,length(timeSeq2)),timeSeq2)
  out2 <- merge(out2, temp2, fill=0)
}
colnames(out2) <- paste(testdata[,1], testdata[,2], sep = ".")
head(out2)

I get the following errors:

  • Error in UseMethod("align.time") : no applicable method for 'align.time' applied to an object of class "Date"
  • Error in colnames<-(*tmp*, value = c("A.1", "A.2", "A.3", "B.1", "B.2", : attempt to set 'colnames' on an object with less than two dimensions

What is wrong with my code for out2?

Are there any other better solutions or packages I can learn about?

I realize this is probably a very roundabout way to get to my desired output.

Any help would be much appreciated.

like image 870
George K Avatar asked Jun 28 '13 11:06

George K


1 Answers

In this solution I read your original data and format it to get the generated data of my previous answer. The data provided is limited to 22 trips by driver, but the reshaping here is not limited by the number of trips. The idea is similar to the one used to generate sample data. I am using data.table since it is handy to manipulate data per group.

So for each(day,driver) I do the following:

  1. create a vector of zeros of length the number of minutes
  2. read start and end position using XXXstrip_start and XXXstrip_end.
  3. create sequence seq(start,end)
  4. use this sequence to change zeros by a sequence of number

Here my code:

start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
hours.min <- format(seq(start.min, 
                        length.out=24*60, by = "mins"),
                    '%H:%M')
library(data.table)
diary <- read.csv("samplediary.csv",
                  stringsAsFactors=FALSE)
DT <- data.table(diary,key=c('id','veh_assigned','day'))

dat <- DT[, as.list({ .SD;nb.trip=sum_trips
           tripv <- vector(mode='integer',length(hours.min))
           if(sum_trips>0){
             starts = mget(paste0('X',seq(nb.trip),'_trip_start'))
             ends = mget(paste0('X',seq(nb.trip),'_trip_end'))
             ids <- mapply(function(x,y){
                                        seq(as.integer(x),as.integer(y))},
                           starts,ends,SIMPLIFY = FALSE)
             for (x in seq_along(ids))tripv[ids[[x]]] <- x
             }
            tripv
           }),
   by=c('id','day')]
setnames(x=dat,old=paste0('V',seq(hours.min)),hours.min)

Here what you get for if you subset the 10 first variables :

dat[1:10,1:10,with=FALSE]


       id day 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07
 1: 3847339   1     0     0     0     0     0     0     0     0
 2: 3847384   1     0     0     0     0     0     0     0     0
 3: 3847436   1     0     0     0     0     0     0     0     0
 4: 3847439   1     0     0     0     0     0     0     0     0
 5: 3847510   1     0     0     0     0     0     0     0     0
 6: 3847536   1     0     0     0     0     0     0     0     0
 7: 3847614   1     0     0     0     0     0     0     0     0
 8: 3847683   1     0     0     0     0     0     0     0     0
 9: 3847841   1     0     0     0     0     0     0     0     0
10: 3847850   1     0     0     0     0     0     0     0     0

One idea is to create a heatmap of your data ( at least per day) to get some intutions and see overlapping drivers for example. Here 2 ways to do this using lattice and ggplot2 but first I will reshape the data in the long format using reshape2

library(reshape2)
dat.m <- melt(dat,id.vars=c('id','day'))

Then I plot my heatmap to see which drivers are overlapping with others for example:

library(lattice)
levelplot(value~as.numeric(variable)*factor(id),data=dat.m)

enter image description here

library(ggplot2)
ggplot(dat.m, aes(x=as.numeric(variable),y=factor(id)))+ 
        geom_tile(aes(fill = value)) +
  scale_fill_gradient(low="grey",high="blue")

enter image description here

like image 130
agstudy Avatar answered Sep 29 '22 12:09

agstudy