Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpolate values from a grid efficiently in R

I have a grid of ocean depth data by location, and am trying to interpolate depth values for a selection of GPS points.

We've been using RSAGA::pick.from.points, which works fine for small data sets.

require(RSAGA)

depthdata <- cbind.data.frame(x=c(74.136, 74.135, 74.134, 74.133, 74.132, 74.131, 74.130, 74.129, 74.128, 74.127), 
y=rep(40, times=10), 
depth=c(-0.6, -0.6, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.6, -0.6))

mylocs <- rbind(c(-74.1325, 40), c(-74.1305, 40))
colnames(mylocs) <- c("x", "y")

results <- pick.from.points(data=mylocs, src=depthdata, pick=c("depth"), method="nearest.neighbour")
mydepths <- results$depth

But our depth data set contains 69 million data points, and we have 5 million GPS points that we'd like depth estimates for, and pick.from.points is just taking too long (> 2 weeks) for this data set. We think that we could accomplish this task more quickly in MATLAB or ArcMap, but we're trying to incorporate this task into a longer workflow in R that we're writing for other people to run repeatedly, so switching to proprietary software for part of that workflow is less than desirable.

We'd be willing to sacrifice some degree of accuracy for speed.

I've looked for a solution as best as I can, but I'm fairly new to grid data and interpolation, so might be using inappropriate language and therefore missing a simple solution.


like image 704
fredtal Avatar asked May 15 '15 14:05

fredtal


1 Answers

If you were willing to impute by finding the nearest neighbor and using its value, I think the trick would be to use an efficient nearest neighbors implementation that allows you to find the nearest neighbor among n alternatives in O(log(n)) time. The k-d tree provides this sort of performance, and is available through the FNN package in R. While the computation (on randomly generated data with 69 million data points for reference and 5 million data points to impute) isn't instantaneous (it takes about 3 minutes), it's much quicker than 2 weeks!

data <- cbind(x=rnorm(6.9e7), y=rnorm(6.9e7))
labels <- rnorm(6.9e7)
query <- cbind(x=rnorm(5e6), y=rnorm(5e6))

library(FNN)
get.nn <- function(data, labels, query) {
  nns <- get.knnx(data, query, k=1)
  labels[nns$nn.index]
}
system.time(get.nn(data, labels, query))
#    user  system elapsed
# 174.975   2.236 177.617

As a warning, the process peaked around 10GB of RAM, so you will need significant memory resources to run on a dataset of your size.

like image 131
josliber Avatar answered Sep 18 '22 02:09

josliber