Another novice question regarding big data. I'm working with a large dataset (3.5m rows) with time series data. I want to create a data.table
with a column that finds the first time the unique identifier appears.
df is a data.table
, df$timestamp
is a date in class POSIXct
, and df$id
is the unique numeric identifier. I'm using the following code:
# UPDATED - DATA KEYED
setkey(df, id)
sub_df<-df[,(min(timestamp)), by=list(id)] # Finding first timestamp for each unique ID
Here's the catch. I'm aggregating over 80k unique ID's. R is choking. Anything I can do to optimize my approach?
Here's a little code to test which behaviour takes a LOT of time
require(data.table)
dt <- data.table(sample(seq(as.Date("2012-01-01"), as.Date("2013-12-31"),
by="days"), 1e5, replace=T), val=sample(1e4, 1e5, replace = T))
FUN1 <- function() {
out <- dt[, min(dt$V1), by=val] # min of entire V1 for each group i.e. wrong
}
FUN2 <- function() {
out <- dt[, min(V1), by=val] # min of V1 within group as intended
}
require(rbenchmark)
> benchmark(FUN1(), FUN2(), replications = 1, order="elapsed")
# test replications elapsed relative user.self sys.self user.child sys.child
# 2 FUN2() 1 0.271 1.000 0.242 0.002 0 0
# 1 FUN1() 1 38.378 141.616 32.584 4.153 0 0
It is very clear that FUN2()
is blazing fast. Remember, in both cases, KEY wasn't set
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With