Summary of problem: I am cleaning up a fish telemetry dataset (i.e., spatial coordinates through time) using the data.table
package (version 1.9.5) in R
(version) on a Windows 7 PC. Some of data points are wrong (e.g., the telemetry equipment picked up echos). We can tell these points are wrong because the fish moved a farther distance than is biologically possible and stand out as outliers. The actual dataset contains over 2,000,000 rows of data from 30 individual fish, hence the use of the data.table
package.
I am removing points that are too far apart (i.e., distance traveled is greater than a maximum distance). However, I need to recalculate distance traveled between points after removing a point because 2-3 data points were sometimes misrecorded in clusters. Currently, I have a for
loop that gets the job done, but is likely far from optimal and I know that I am likely missing some of the powerful tools in the data.table
package.
As technical notes, my spatial scale is small enough that a Euclidean distance works and my maximum distance criteria is biology reasonable.
Where I have looked for help: I have looked through SO and found several helpful answers, but none exactly match my problem. Specifically, all of the other answers only compare one column of data to among rows.
This answer compares two rows using data.table
, but only looks at one variable.
This answer looks promising and uses Reduce
, but I could not figure out how to use Reduce
with two columns.
This answer uses an indexing feature from data.table
, but I could not figure out how to use it with a distance function.
Last, this answer demonstrates the roll
function of data.table
. However, I could not figure out how to use two variables with this function either.
Here is my MVCE:
library(data.table)
## Create dummy data.table
dt <- data.table(fish = 1,
time = 1:6,
easting = c(1, 2, 10, 11, 3, 4),
northing = c(1, 2, 10, 11, 3, 4))
dt[ , dist := 0]
maxDist = 5
## First pass of calculating distances
for(index in 2:dim(dt)[1]){
dt[ index,
dist := as.numeric(dist(dt[c(index -1, index),
list(easting, northing)]))]
}
## Loop through and remove points until all of the outliers have been
## removed for the data.table.
while(all(dt[ , dist < maxDist]) == FALSE){
dt <- copy(dt[ - dt[ , min(which(dist > maxDist))], ])
## Loops through and recalculates distance after removing outlier
for(index in 2:dim(dt)[1]){
dt[ index,
dist := as.numeric(dist(dt[c(index -1, index),
list(easting, northing)]))]
}
}
This length can be computed with the help of Pythagora's theorem: dist = sqrt((x2-x1)^2 + (y2-y1)^2) . This is known as the Euclidian distance between the points.
The distance formula is used to handle this job and is straightforward: Take the difference between the X-values and the difference between the Y-values, add the squares of these, and take the square root of the sum to find the straight-line distance, as in the distance between two points on Google maps over the ground ...
I'm a little confused why you keep recomputing the distance (and needlessly copying data) instead of just doing a single pass:
last = 1
idx = rep(0, nrow(dt))
for (curr in 1:nrow(dt)) {
if (dist(dt[c(curr, last), .(easting, northing)]) <= maxDist) {
idx[curr] = curr
last = curr
}
}
dt[idx]
# fish time easting northing
#1: 1 1 1 1
#2: 1 2 2 2
#3: 1 5 3 3
#4: 1 6 4 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With