I came across the following problem today and I am wondering if there is a better way to accomplish what I am trying to do.
Let's suppose I have the following data.table
(just an hourly timestamp):
library(data.table)
tdt <- data.table(Timestamp = seq(as.POSIXct("1980-01-01 00:00:00"), as.POSIXct("2015-01-01 00:00:00"), '1 hour'))
> tdt
Timestamp
1: 1980-01-01 00:00:00
2: 1980-01-01 01:00:00
3: 1980-01-01 02:00:00
4: 1980-01-01 03:00:00
5: 1980-01-01 04:00:00
---
306813: 2014-12-31 20:00:00
306814: 2014-12-31 21:00:00
306815: 2014-12-31 22:00:00
306816: 2014-12-31 23:00:00
306817: 2015-01-01 00:00:00
My goal is to change the minutes of the timestamp to, say, 10 minutes.
I know I can use:
library(lubridate)
minute(tdt$Timestamp) <- 10
but this does not utilize the super fast speed of data table (which I need). On my laptop this took:
> system.time(minute(tdt$Timestamp) <- 10)
user system elapsed
11.29 0.16 11.45
So, my question is: Can we somehow use a replacement function in the data table syntax so that it will do what I want using data.table
's speed? If the answer is no, any other data.table
solution to do this fast, would be acceptable.
If you wonder one of the things I tried is:
tdt[, Timestamp2 := minute(Timestamp) <- 10]
which does not work.
> tdt
Timestamp
1: 1980-01-01 00:10:00
2: 1980-01-01 01:10:00
3: 1980-01-01 02:10:00
4: 1980-01-01 03:10:00
5: 1980-01-01 04:10:00
---
306813: 2014-12-31 20:10:00
306814: 2014-12-31 21:10:00
306815: 2014-12-31 22:10:00
306816: 2014-12-31 23:10:00
306817: 2015-01-01 00:10:00
A POSIXct
object is just a double with some attributes
storage.mode(as.POSIXct("1980-01-01 00:00:00"))
## [1] "double"
So in order to manipulate it efficiently you can just treat it as one, for instance
tdt[, Timestamp := Timestamp + 600L]
Will add 600 seconds (10 minutes) to each row by reference
Some benchmarks
tdt <- data.table(Timestamp = seq(as.POSIXct("1600-01-01 00:00:00"),
as.POSIXct("2015-01-01 00:00:00"),
'1 hour'))
system.time(minute(tdt$Timestamp) <- 10)
# user system elapsed
# 124.86 1.95 127.68
system.time(set(tdt, j = 1L, value = `minute<-`(tdt$Timestamp, 10)))
# user system elapsed
# 124.99 1.83 128.25
system.time(tdt[, Timestamp := Timestamp + dminutes(10)])
# user system elapsed
# 0.39 0.04 0.42
system.time(tdt[, Timestamp := Timestamp + 600L])
# user system elapsed
# 0.01 0.00 0.01
Replacement functions are run in two steps:
You can run step 1 without running step 2. That result can then be used to set the data.table column (set
used here but you could use :=
as well).
library(lubridate)
library(data.table)
tdt <- data.table(Timestamp = seq(as.POSIXct("1980-01-01 00:00:00"), as.POSIXct("2015-01-01 00:00:00"), '1 hour'))
minute(tdt$Timestamp) <- 20
print( `minute<-`(tdt$Timestamp,11) )
set( tdt, j=1L,value=`minute<-`(tdt$Timestamp,11) )
Edit: Small data.table vs. big data.table benchmarking
library(lubridate)
library(data.table)
library(microbenchmark)
# Config
tms <- 5L
# Sample data, 1 column
tdt <- data.table(Timestamp = seq(as.POSIXct("1980-01-01 00:00:00"), as.POSIXct("2015-01-01 00:00:00"), '1 hour'))
minute(tdt$Timestamp) <- 20
tdf <- as.data.frame( tdt )
# Sample data, lots of columns
bdf <- cbind( tdf, as.data.frame( replicate( 100, runif(nrow(tdt)) ) ) )
bdt <- as.data.table( bdf )
# Benchmark
microbenchmark(
`minute<-`(tdt$Timestamp,10), # How long does the operation to generate the new vector itself take?
set( tdt, j=1L,value=`minute<-`(tdt$Timestamp,11) ), # One column: How long does it take to generate the new vector and replace the contents in the data.table?
minute( tdf$Timestamp ) <- 12, # One column: How long does it take to do it with a data.frame?
set( tdt, j=1L,value=`minute<-`(bdt$Timestamp,13) ), # Many columns: How long does it take to generate the new vector and replace the contents in the data.table?
minute( bdf$Timestamp ) <- 14, # Many columns: How long does it take to do it with a data.frame?
times = tms
)
Unit: seconds
expr min lq mean median uq max neval
`minute<-`(tdt$Timestamp, 10) 1.304388 1.385883 1.417616 1.389316 1.459166 1.549327 5
set(tdt, j = 1L, value = `minute<-`(tdt$Timestamp, 11)) 1.314495 1.344277 1.376241 1.352124 1.389083 1.481225 5
minute(tdf$Timestamp) <- 12 1.342104 1.349231 1.488639 1.378840 1.380659 1.992358 5
set(tdt, j = 1L, value = `minute<-`(bdt$Timestamp, 13)) 1.337944 1.383429 1.402802 1.418211 1.418922 1.455503 5
minute(bdf$Timestamp) <- 14 1.332482 1.333713 1.355331 1.335728 1.342607 1.432127 5
Looks like it is no faster, which belies my understanding of what is going on. Strange.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With