Problem: We need a big data method for calculating distances between points. We outline what we'd like to do below with a five-observation dataframe. However, this particular method is infeasible as the number of rows gets large (> 1 million). In the past, we've used SAS to do this kind of analysis, but we'd prefer R if possible. (Note: I'm not going to show code because, while I outline a way to do this on smaller datasets below, this is basically an impossible method to use with data on our scale.)
We start with a dataframe of stores, each of which has a latitude and longitude (though this is not a spatial file, nor do we want to use a spatial file).
# you can think of x and y in this example as Cartesian coordinates
stores <- data.frame(id = 1:5,
x = c(1, 0, 1, 2, 0),
y = c(1, 2, 0, 2, 0))
stores
id x y
1 1 1 1
2 2 0 2
3 3 1 0
4 4 2 2
5 5 0 0
For each store, we want to know the number of stores within x distance. In a small dataframe, this is straightforward. Create another dataframe of all coordinates, merge back in, calculate distances, create an indicator if the distance is less than x and add up the indicators (minus one for the store itself, which is at distance 0). This would result in a dataset that looks like this:
id x y s1.dist s2.dist s3.dist s4.dist s5.dist
1: 1 1 1 0.000000 1.414214 1.000000 1.414214 1.414214
2: 2 0 2 1.414214 0.000000 2.236068 2.000000 2.000000
3: 3 1 0 1.000000 2.236068 0.000000 2.236068 1.000000
4: 4 2 2 1.414214 2.000000 2.236068 0.000000 2.828427
5: 5 0 0 1.414214 2.000000 1.000000 2.828427 0.000000
When you count (arbitrarily) under 1.45 as "close," you end up with indicators that look like this:
# don't include the store itself in the total
id x y s1.close s2.close s3.close s4.close s5.close total.close
1: 1 1 1 1 1 1 1 1 4
2: 2 0 2 1 1 0 0 0 1
3: 3 1 0 1 0 1 0 1 2
4: 4 2 2 1 0 0 1 0 1
5: 5 0 0 1 0 1 0 1 2
The final product should look like this:
id total.close
1: 1 4
2: 2 1
3: 3 2
4: 4 1
5: 5 2
All advice appreciated.
Thank you very much
The dist() function in R can be used to calculate a distance matrix, which displays the distances between the rows of a matrix or data frame.
Euclidean distance is the shortest possible distance between two points. Formula to calculate this distance is : Euclidean distance = √Σ(xi-yi)^2 where, x and y are the input values. The distance between 2 arrays can also be calculated in R, the array function takes a vector and array dimension as inputs.
To give an example: The distance matrix in hierarchical cluster analysis on 10.000 records contains almost 50 Million distances. If Big Data has to be tackle with R, five different strategies can be considered: If data is too big to be analyzed in complete, its’ size can be reduced by sampling.
To give an example: The distance matrix in hierarchical cluster analysis on 10.000 records contains almost 50 Million distances. If Big Data has to be tackle with R, five different strategies can be considered:
As a rule of thumb: Data sets that contain up to one million records can easily processed with standard R. Data sets with about one million to one billion records can also be processed in R, but need some additional effort. Data sets that contain more than one billion records need to be analyzed by map reduce algorithms.
Use the Database Takes advantage of what databases are often best at: quickly summarizing and filtering data based on a query. More Info, Less Transfer By compressing before pulling data back to R, the entire data set gets used, but transfer times are far less than moving the entire data set.
Any reason you can't loop instead of making it one big calculation?
stores <- data.frame(id = 1:5,
x = c(1, 0, 1, 2, 0),
y = c(1, 2, 0, 2, 0))
# Here's a Euclidean distance metric, but you can drop anything you want in here
distfun <- function(x0, y0, x1, y1){
sqrt((x1-x0)^2+(y1-y0)^2)
}
# Loop over each store
t(sapply(seq_len(nrow(stores)), function(i){
distances <- distfun(x0 = stores$x[i], x1 = stores$x,
y0 = stores$y[i], y1 = stores$y)
# Calculate number less than arbitrary cutoff, subtract one for self
num_within <- sum(distances<1.45)-1
c(stores$id[i], num_within)
}))
Produces:
[,1] [,2]
[1,] 1 4
[2,] 2 1
[3,] 3 2
[4,] 4 1
[5,] 5 2
This will work with a data set of any size that you can bring into R, but it'll just get slower as the size increases. Here's a test on 10,000 entries that runs in a couple seconds on my machine:
stores <- data.frame(id=1:10000,
x=runif(10000, max = 10),
y=runif(10000, max = 10))
[,1] [,2]
[1,] 1 679
[2,] 2 698
[3,] 3 618
[4,] 4 434
[5,] 5 402
...
[9995,] 9995 529
[9996,] 9996 626
[9997,] 9997 649
[9998,] 9998 514
[9999,] 9999 667
[10000,] 10000 603
It get slower with more calculations (because it has to run between every pair of points, this will always be O(n^2)) but without knowing the actual distance metric you'd like to calculate we can't optimize the slow part any further.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With