Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Assign column value based on closest match in second data frame

I have two data frames, logger and df (times are numeric):

logger <- data.frame(
time = c(1280248354:1280248413),
temp = runif(60,min=18,max=24.5)
)

df <- data.frame(
obs = c(1:10),
time = runif(10,min=1280248354,max=1280248413),
temp = NA
)

I would like to search logger$time for the closest match to each row in df$time, and assign the associated logger$temp to df$temp. So far, I have been successful using the following loop:

for (i in 1:length(df$time)){
closestto<-which.min(abs((logger$time) - (df$time[i])))
df$temp[i]<-logger$temp[closestto]
}

However, I now have large data frames (logger has 13,620 rows and df has 266138) and processing times are long. I've read that loops are not the most efficient way to do things, but I am unfamiliar with alternatives. Is there a faster way to do this?

like image 484
dschorn Avatar asked Nov 13 '13 15:11

dschorn


2 Answers

I'd use data.table for this. It makes it super easy and super fast joining on keys. There is even a really helpful roll = "nearest" argument for exactly the behaviour you are looking for (except in your example data it is not necessary because all times from df appear in logger). In the following example I renamed df$time to df$time1 to make it clear which column belongs to which table...

#  Load package
require( data.table )

#  Make data.frames into data.tables with a key column
ldt <- data.table( logger , key = "time" )
dt <- data.table( df , key = "time1" )

#  Join based on the key column of the two tables (time & time1)
#  roll = "nearest" gives the desired behaviour
#  list( obs , time1 , temp ) gives the columns you want to return from dt
ldt[ dt , list( obs , time1 , temp ) , roll = "nearest" ]
#          time obs      time1     temp
# 1: 1280248361   8 1280248361 18.07644
# 2: 1280248366   4 1280248366 21.88957
# 3: 1280248370   3 1280248370 19.09015
# 4: 1280248376   5 1280248376 22.39770
# 5: 1280248381   6 1280248381 24.12758
# 6: 1280248383  10 1280248383 22.70919
# 7: 1280248385   1 1280248385 18.78183
# 8: 1280248389   2 1280248389 18.17874
# 9: 1280248393   9 1280248393 18.03098
#10: 1280248403   7 1280248403 22.74372
like image 120
Simon O'Hanlon Avatar answered Oct 09 '22 11:10

Simon O'Hanlon


You could use the data.table library. This will also help with being more efficient with large data size -

library(data.table)

logger <- data.frame(
  time = c(1280248354:1280248413),
  temp = runif(60,min=18,max=24.5)
)

df <- data.frame(
  obs = c(1:10),
  time = runif(10,min=1280248354,max=1280248413)
)

logger <- data.table(logger)
df <- data.table(df)

setkey(df,time)
setkey(logger,time)

df2 <- logger[df, roll = "nearest"]

Output -

> df2
          time     temp obs
 1: 1280248356 22.81437   7
 2: 1280248360 24.08711  10
 3: 1280248366 22.31738   2
 4: 1280248367 18.61222   5
 5: 1280248388 19.46300   4
 6: 1280248393 18.26535   6
 7: 1280248400 20.61901   9
 8: 1280248402 21.92584   1
 9: 1280248410 19.36526   8
10: 1280248410 19.36526   3
like image 21
TheComeOnMan Avatar answered Oct 09 '22 10:10

TheComeOnMan