Big data ways to calculate sets of distances in R?

Tags:

Problem: We need a big data method for calculating distances between points. We outline what we'd like to do below with a five-observation dataframe. However, this particular method is infeasible as the number of rows gets large (> 1 million). In the past, we've used SAS to do this kind of analysis, but we'd prefer R if possible. (Note: I'm not going to show code because, while I outline a way to do this on smaller datasets below, this is basically an impossible method to use with data on our scale.)

We start with a dataframe of stores, each of which has a latitude and longitude (though this is not a spatial file, nor do we want to use a spatial file).

# you can think of x and y in this example as Cartesian coordinates
stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

stores
  id x y
1  1 1 1
2  2 0 2
3  3 1 0
4  4 2 2
5  5 0 0

For each store, we want to know the number of stores within x distance. In a small dataframe, this is straightforward. Create another dataframe of all coordinates, merge back in, calculate distances, create an indicator if the distance is less than x and add up the indicators (minus one for the store itself, which is at distance 0). This would result in a dataset that looks like this:

   id x y  s1.dist  s2.dist  s3.dist  s4.dist  s5.dist
1:  1 1 1 0.000000 1.414214 1.000000 1.414214 1.414214
2:  2 0 2 1.414214 0.000000 2.236068 2.000000 2.000000
3:  3 1 0 1.000000 2.236068 0.000000 2.236068 1.000000
4:  4 2 2 1.414214 2.000000 2.236068 0.000000 2.828427
5:  5 0 0 1.414214 2.000000 1.000000 2.828427 0.000000

When you count (arbitrarily) under 1.45 as "close," you end up with indicators that look like this:

# don't include the store itself in the total
   id x y s1.close s2.close s3.close s4.close s5.close total.close
1:  1 1 1        1        1        1        1        1           4
2:  2 0 2        1        1        0        0        0           1
3:  3 1 0        1        0        1        0        1           2
4:  4 2 2        1        0        0        1        0           1
5:  5 0 0        1        0        1        0        1           2

The final product should look like this:

   id total.close
1:  1           4
2:  2           1
3:  3           2
4:  4           1
5:  5           2

All advice appreciated.

Thank you very much

929

asked Dec 17 '21 16:12

dmcd

1 Answers

Any reason you can't loop instead of making it one big calculation?

stores <- data.frame(id = 1:5,
                     x = c(1, 0, 1, 2, 0),
                     y = c(1, 2, 0, 2, 0))

# Here's a Euclidean distance metric, but you can drop anything you want in here
distfun <- function(x0, y0, x1, y1){
  sqrt((x1-x0)^2+(y1-y0)^2)
}

# Loop over each store
t(sapply(seq_len(nrow(stores)), function(i){
  distances <- distfun(x0 = stores$x[i], x1 = stores$x,
                       y0 = stores$y[i], y1 = stores$y)
  # Calculate number less than arbitrary cutoff, subtract one for self
  num_within <- sum(distances<1.45)-1
  c(stores$id[i], num_within)
}))

Produces:

     [,1] [,2]
[1,]    1    4
[2,]    2    1
[3,]    3    2
[4,]    4    1
[5,]    5    2

This will work with a data set of any size that you can bring into R, but it'll just get slower as the size increases. Here's a test on 10,000 entries that runs in a couple seconds on my machine:

stores <- data.frame(id=1:10000, 
                     x=runif(10000, max = 10), 
                     y=runif(10000, max = 10))

          [,1] [,2]
    [1,]     1  679
    [2,]     2  698
    [3,]     3  618
    [4,]     4  434
    [5,]     5  402
...
 [9995,]  9995  529
 [9996,]  9996  626
 [9997,]  9997  649
 [9998,]  9998  514
 [9999,]  9999  667
[10000,] 10000  603

It get slower with more calculations (because it has to run between every pair of points, this will always be O(n^2)) but without knowing the actual distance metric you'd like to calculate we can't optimize the slow part any further.

answered Oct 28 '22 14:10

Dubukay

Related questions
                            
                                How to read data from google drive using R in colab?
                            
                                Is there any explicit guarantee that dplyr operations preserve row order?
                            
                                Comparison of two vectors resulted after simulation
                            
                                Function of function always returns 0-R
                            
                                Why do I get a segfault when calling my C++ function with .Call rather than .C?
                            
                                DiagrammeR - arrow problems
                            
                                R Flatten nested lists of different lengths (Google geocode API output) in R
                            
                                Can't plot ggplot2 objects created with R 3.x into R 4.x imported from a RDS file
                            
                                Pivoting data with varying width from wide to long with flexible call (to be used in loop)
                            
                                How to specify random coefficients priors in rstanarm?
                            
                                How to exact match two column values in entire Dataset using R
                            
                                Pairwise correlation from Dunnett's rank test
                            
                                Dropping a column from a data.frame causes unwanted loss of an attribute
                            
                                data.table switches column names
                            
                                Install with devtools::install_github() fails to detect build tools
                            
                                Vertically scrollable code with RStudio and xaringan
                            
                                Add multiple level x-label in ggplot2
                            
                                How to add to a cnetplot using ggplot functions?
                            
                                Parallelizing / Multithreading with data.table
                            
                                In R, how can I get PRNG to give identical floating point numbers between platforms?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Big data ways to calculate sets of distances in R?

Tags:

dataframe

r

matrix

coordinates

bigdata

dmcd

People also ask

1 Answers

Dubukay

Recent Activity

Donate For Us