I have a database with >300,000 records of animal observations. Each row represents an animal location. Each animal has a unique id (id1
) and several columns with attributes related to that animal location, including date of observation and x and y coordinates.
Could someone help me create a code that will allow me to the following:
1) Subset data by BOTH date
and id1
2) Measure the distance (coordinates are in UTM so distance will be in meters) between the FIRST and LAST location record for each date
for each different id1
An example of the data is the following:
mydata<-read.table(text =
"id1 date x y
1 11/02/2014 478776.4332 7922167.59
1 11/02/2014 478776.4333 7922170.59
1 11/02/2014 478776.4334 7922180.59
1 12/02/2014 478776.4335 7922190.59
1 12/02/2014 478776.4350 7922192.59
1 12/02/2014 478776.4360 7922195.59
2 11/02/2014 478776.4338 7922167.59
2 11/02/2014 478776.4339 7922183.59
2 11/02/2014 478776.4340 7922185.59
2 12/02/2014 478776.4350 7922188.30
2 12/02/2014 478776.4360 7922190.59
2 12/02/2014 478776.4390 7922198.59
3 11/02/2014 478776.4338 7922167.59
3 11/02/2014 478776.4345 7922175.59
3 11/02/2014 478776.4355 7922178.85
3 12/02/2014 478776.4368 7922180.59
3 12/02/2014 478776.4398 7922183.59
3 12/02/2014 478776.4399 7922185.59
4 11/02/2014 478776.4338 7922167.59
4 11/02/2014 478776.4340 7922172.59
4 11/02/2014 478776.4345 7922178.59
3 11/02/2014 478776.4350 7922179.59
3 12/02/2014 478776.4355 7922184.59
3 12/02/2014 478776.4360 7922187.59
3 12/02/2014 478776.4365 7922198.59",
header = TRUE)
A much less efficient alternative would be to select the first and last records for each date
and id1
and then measure distance between all pairs of points. I found a code to select the LAST record for each individual animal but I will still have to incorporate the subset by date as well:
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
Could someone help me with either strategy, preferentially with the most straightforward way to do this?
Thank you!
using dplyr:
library(dplyr)
mydata %>%
group_by(id1, date) %>%
summarize(x1 = first(x), y1 = first(y), x2 = last(x), y2 = last(y)) %>%
mutate(distance = sqrt((x2-x1)^2+(y2-y1)^2))
Results:
Source: local data frame [7 x 7]
Groups: id1 [4]
id1 date x1 y1 x2 y2 distance
(int) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 11/02/2014 478776.4 7922168 478776.4 7922181 13.000000
2 1 12/02/2014 478776.4 7922191 478776.4 7922196 5.000001
3 2 11/02/2014 478776.4 7922168 478776.4 7922186 18.000000
4 2 12/02/2014 478776.4 7922188 478776.4 7922199 10.290001
5 3 11/02/2014 478776.4 7922168 478776.4 7922180 12.000000
6 3 12/02/2014 478776.4 7922181 478776.4 7922199 18.000000
7 4 11/02/2014 478776.4 7922168 478776.4 7922179 11.000000
You can construct the data using data.table
# load data.table and make mytable a data.table
library(data.table)
setDT(mydata)
# contstruct new dataset
newData <- mydata[, .(startx=head(x, 1), endx=tail(x, 1),
starty=head(y, 1), endy=tail(y, 1)),
by=.(id1, date)]
# calculate distance
newData[, distance := sqrt((endx-startx)^2+(endy-starty)^2)]
I use the same (Euclidean) distance formula as @edward-r-mazurek.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With