Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

calculating distance between two row in a data.table

Tags:

r

data.table

Summary of problem: I am cleaning up a fish telemetry dataset (i.e., spatial coordinates through time) using the data.table package (version 1.9.5) in R (version) on a Windows 7 PC. Some of data points are wrong (e.g., the telemetry equipment picked up echos). We can tell these points are wrong because the fish moved a farther distance than is biologically possible and stand out as outliers. The actual dataset contains over 2,000,000 rows of data from 30 individual fish, hence the use of the data.table package.

I am removing points that are too far apart (i.e., distance traveled is greater than a maximum distance). However, I need to recalculate distance traveled between points after removing a point because 2-3 data points were sometimes misrecorded in clusters. Currently, I have a for loop that gets the job done, but is likely far from optimal and I know that I am likely missing some of the powerful tools in the data.table package.

As technical notes, my spatial scale is small enough that a Euclidean distance works and my maximum distance criteria is biology reasonable.

Where I have looked for help: I have looked through SO and found several helpful answers, but none exactly match my problem. Specifically, all of the other answers only compare one column of data to among rows.

  1. This answer compares two rows using data.table, but only looks at one variable.

  2. This answer looks promising and uses Reduce, but I could not figure out how to use Reduce with two columns.

  3. This answer uses an indexing feature from data.table, but I could not figure out how to use it with a distance function.

  4. Last, this answer demonstrates the roll function of data.table. However, I could not figure out how to use two variables with this function either.

Here is my MVCE:

library(data.table)
## Create dummy data.table
dt <- data.table(fish = 1,
                 time = 1:6,
                 easting = c(1, 2, 10, 11, 3, 4),
                 northing = c(1, 2, 10, 11, 3, 4))
dt[ , dist := 0]

maxDist = 5

## First pass of calculating distances 
for(index in 2:dim(dt)[1]){
    dt[ index,
       dist := as.numeric(dist(dt[c(index -1, index),
                list(easting, northing)]))]
}

## Loop through and remove points until all of the outliers have been
## removed for the data.table. 
while(all(dt[ , dist < maxDist]) == FALSE){
    dt <- copy(dt[ - dt[ , min(which(dist > maxDist))], ])
    ## Loops through and recalculates distance after removing outlier  
    for(index in 2:dim(dt)[1]){
        dt[ index,
           dist := as.numeric(dist(dt[c(index -1, index),
                    list(easting, northing)]))]
    }
}
like image 689
Richard Erickson Avatar asked Sep 14 '15 19:09

Richard Erickson


People also ask

How do you find the distance between two points in a matrix?

This length can be computed with the help of Pythagora's theorem: dist = sqrt((x2-x1)^2 + (y2-y1)^2) . This is known as the Euclidian distance between the points.

How do you calculate distance on a grid?

The distance formula is used to handle this job and is straightforward: Take the difference between the X-values and the difference between the Y-values, add the squares of these, and take the square root of the sum to find the straight-line distance, as in the distance between two points on Google maps over the ground ...


1 Answers

I'm a little confused why you keep recomputing the distance (and needlessly copying data) instead of just doing a single pass:

last = 1
idx = rep(0, nrow(dt))
for (curr in 1:nrow(dt)) {
  if (dist(dt[c(curr, last), .(easting, northing)]) <= maxDist) {
    idx[curr] = curr
    last = curr
  }
}

dt[idx]
#   fish time easting northing
#1:    1    1       1        1
#2:    1    2       2        2
#3:    1    5       3        3
#4:    1    6       4        4
like image 138
eddi Avatar answered Oct 13 '22 23:10

eddi