I am new to R, but after taking an intro course and playing with it a bit, I'm hopeful that it can 1) more elegantly solve my modelling objectives (compared to Excel, which is my backup plan) and 2) be a useful skill to take away from this project.
The task/objective:
I am attempting to use driving diary data to simulate and model potential energy and GHG emissions from electric cars. Specifically:
The type of output I would like: NOTE: this output is NOT related to the sample data provided below. I used the first ten minutes of a certain day with some theoretical trips just as an example
Not essential to this problem, but may be useful to know: I will use the above output to cross-reference other driver-specific data to calculate minute-by-minute consumption of gasoline (or electricity) based on things associated with that trip, such as parking location or trip distance. I would like to do this in R but must first figure out the above problem before I move onto this step.
The solution I have so far is based on:
The problem:
Example simplified data:
a <- c("A","A","A","B","B","B","C","C","C")
b <- c(1, 2, 3, 1, 2, 3, 1, 2, 3)
c <- as.POSIXct(c(0.29167, 0.59375, 0.83333, 0.45833, 0.55347, 0.27083, 0.34375, 0.39236, 0.35417)*24*3600 + as.POSIXct("2013-1-1 00:00") )
d <- as.POSIXct(c(0.334027778, 0.614583333, 0.875, 0.461805556, 0.563888889, 0.295138889, 0.375, 0.503472222, 0.364583333)*24*3600 + as.POSIXct("2013-1-1 00:00"))
e <- c(2, 8, 2, 5, 5, 2, 5, 5, 2)
f <- as.POSIXct(c(0, 0.875, 0, 0.479166666666667, 0.580555555555556, 0.489583333333333, 0.430555555555556, 0.541666666666667, 0.711805555555555)*24*3600 + as.POSIXct("2013-1-1 00:00"))
g <- as.POSIXct(c(0, 0.885, 0, 0.482638888888889, 0.588194444444444, 0.496527777777778, 0.454861111111111, 0.559027777777778, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
h <- c(0, 1, 0, 1, 4, 8, 8, 1, 5)
i <- as.POSIXct(c(0, 0, 0, 0.729166666666667, 0.595833333333333, 0.534722222222222, 0.59375, 0.779861111111111, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
j <- as.POSIXct(c(0, 0, 0, 0.736111111111111, 0.605555555555556, 0.541666666666667, 0.611111111111111, 0.788194444444445, 0.75625)*24*3600 + as.POSIXct("2013-1-1 00:00"))
k <- c(0, 0, 0, 4, 4, 2, 5, 8,1)
testdata <- data.frame(a,b,c,d,e,f,g,h,i,j,k)
names(testdata) <- c("id", "Day", "trip1_start", "trip1_end", "trip1_purpose", "trip2_start", "trip2_end", "trip2_purpose", "trip3_start", "trip3_end", "trip3_purpose")
In this example data, I have three drivers (id = A, B, C) who each drive on three different days (day = 1, 2, 3). Note that some drivers may have different numbers of trips. The time stamps indicate start and end time of driving activities.
I then create minute intervals for a entire day (January 1, 2013)
start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
end.max <- as.POSIXct("2013-01-01 23:59:59 PST")
tinterval <- seq.POSIXt(start.min, end.max, na.rm=T, by = "mins")
Insert "1" during minutes where a given user is driving:
out1 <- xts(,align.time(tinterval,60))
# loop over each user
for(i in 1:NROW(testdata)) {
# paste the start / end times into an xts-style range
timeRange <- paste(format(testdata[i,c("trip1_start","trip1_end")]),collapse="/")
# add the minute "by parameter" for timeBasedSeq
timeRange <- paste(timeRange,"M",sep="/")
# create the by-minute sequence and align to minutes to match "out"
timeSeq <- align.time(timeBasedSeq(timeRange),60)
# create xts object with "1" entries for times between start and end
temp1 <- xts(rep(1,length(timeSeq)),timeSeq)
# merge temp1 with out and fill non-matching timestamps with "0"
out1 <- merge(out1, temp1, fill=0)
}
# add column names
colnames(out1) <- paste(testdata[,1], testdata[,2], sep = ".")
The idea is to then repeat this for each trip, e.g. out2, out3, etc. wherein I would fill any driving periods with "2", "3", etc., and then sum/merge all of the resulting outx dataframes, and eventually get the desired result.
Unfortunately when I try to repeat this for out2...
out2 <- xts(,align.time(tinterval,60))
for(i in 1:NROW(testdata)) {
timeRange2 <- paste(format(testdata[i,c("trip2_start","trip2_end")]),collapse="/")
timeRange2 <- paste(timeRange2,"M",sep="/")
timeSeq2 <- align.time(timeBasedSeq(timeRange2),60)
temp2 <- xts(rep(2,length(timeSeq2)),timeSeq2)
out2 <- merge(out2, temp2, fill=0)
}
colnames(out2) <- paste(testdata[,1], testdata[,2], sep = ".")
head(out2)
I get the following errors:
- Error in UseMethod("align.time") : no applicable method for 'align.time' applied to an object of class "Date"
- Error in
colnames<-
(*tmp*
, value = c("A.1", "A.2", "A.3", "B.1", "B.2", : attempt to set 'colnames' on an object with less than two dimensions
What is wrong with my code for out2?
Are there any other better solutions or packages I can learn about?
I realize this is probably a very roundabout way to get to my desired output.
Any help would be much appreciated.
In this solution I read your original data and format it to get the generated data of my previous answer. The data provided is limited to 22 trips by driver, but the reshaping here is not limited by the number of trips. The idea is similar to the one used to generate sample data. I am using data.table
since it is handy to manipulate data per group.
So for each(day,driver) I do the following:
Here my code:
start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
hours.min <- format(seq(start.min,
length.out=24*60, by = "mins"),
'%H:%M')
library(data.table)
diary <- read.csv("samplediary.csv",
stringsAsFactors=FALSE)
DT <- data.table(diary,key=c('id','veh_assigned','day'))
dat <- DT[, as.list({ .SD;nb.trip=sum_trips
tripv <- vector(mode='integer',length(hours.min))
if(sum_trips>0){
starts = mget(paste0('X',seq(nb.trip),'_trip_start'))
ends = mget(paste0('X',seq(nb.trip),'_trip_end'))
ids <- mapply(function(x,y){
seq(as.integer(x),as.integer(y))},
starts,ends,SIMPLIFY = FALSE)
for (x in seq_along(ids))tripv[ids[[x]]] <- x
}
tripv
}),
by=c('id','day')]
setnames(x=dat,old=paste0('V',seq(hours.min)),hours.min)
Here what you get for if you subset the 10 first variables :
dat[1:10,1:10,with=FALSE]
id day 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07
1: 3847339 1 0 0 0 0 0 0 0 0
2: 3847384 1 0 0 0 0 0 0 0 0
3: 3847436 1 0 0 0 0 0 0 0 0
4: 3847439 1 0 0 0 0 0 0 0 0
5: 3847510 1 0 0 0 0 0 0 0 0
6: 3847536 1 0 0 0 0 0 0 0 0
7: 3847614 1 0 0 0 0 0 0 0 0
8: 3847683 1 0 0 0 0 0 0 0 0
9: 3847841 1 0 0 0 0 0 0 0 0
10: 3847850 1 0 0 0 0 0 0 0 0
One idea is to create a heatmap of your data ( at least per day) to get some intutions and see overlapping drivers for example. Here 2 ways to do this using lattice
and ggplot2
but first I will reshape the data in the long format using reshape2
library(reshape2)
dat.m <- melt(dat,id.vars=c('id','day'))
Then I plot my heatmap to see which drivers are overlapping with others for example:
library(lattice)
levelplot(value~as.numeric(variable)*factor(id),data=dat.m)
library(ggplot2)
ggplot(dat.m, aes(x=as.numeric(variable),y=factor(id)))+
geom_tile(aes(fill = value)) +
scale_fill_gradient(low="grey",high="blue")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With