Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two data frames, both with coordinates based on the closest location

Tags:

dataframe

r

I have one large dataframe (~130000 rows) containing local variables and an other large dataframe (~7000 rows) containing the density of a species. Both have x and y coordinates but these coordinates don't always match. e.g:

df1 <- data.frame(X = c(2,4,1,2,5), Y = c(6,7,8,9,8), V1 = c("A", "B", "C", "D", "E"), V2 = c("G", "H", "I", "J", "K"))

And:

df2 <- data.frame(X = c(2,4,6), Y = c(5,9,7), Dens = c(12, 17, 10))

I would like to add a column to df1 containing the density (Dens) from df2 if there is a point reasonably close-by. If there is no point close-by I would like it to show up as a NA. e.g:

X Y   V1   V2    Dens
2 6   A    G      12
4 7   B    H      NA     
1 8   C    I      17
2 9   D    J      NA
5 8   E    K      10
like image 462
Honey91 Avatar asked Dec 12 '15 16:12

Honey91


1 Answers

First, let's write a function to find the closest point in df2 for a single line of df1. Here I'm using simple cartesian distance (ie (x1 - x2)^2 + (y1 - y2)^2). If you have lat/lon you might want to change it to a better formula:

mydist <- function(row){
  dists <- (row[["X"]] - df2$X)^2 + (row[["Y"]]- df2$Y)^2
  return(cbind(df2[which.min(dists),], distance = min(dists)))
}

Once you have this, you just need to lapply it to each row, and add it to your first data:

z <- cbind(df1, do.call(rbind, lapply(1:nrow(df1), function(x) mydist(df1[x,])))) 

For your test data, the output looks like:

   X Y V1 V2 X Y Dens distance
1  2 6  A  G 2 5   12        1
2  4 7  B  H 4 9   17        4
3  1 8  C  I 2 5   12       10
21 2 9  D  J 4 9   17        4
22 5 8  E  K 4 9   17        2

You can then do something like this to filter out those over your threshold:

threshold <- 5
z$Dens[z$distance > threshold] <- NA

   X Y V1 V2 X Y Dens distance
1  2 6  A  G 2 5   12        1
2  4 7  B  H 4 9   17        4
3  1 8  C  I 2 5   NA       10
21 2 9  D  J 4 9   17        4
22 5 8  E  K 4 9   17        2

Your actual data is very large (a simulated data set of the same size takes about 10 minutes on my computer). If possible you should merge, then only run this on those those are not exact matches (see dplyr::anti_join).

like image 64
jeremycg Avatar answered Nov 22 '22 09:11

jeremycg