Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Calculating offset differences between elements in data frame with the same identifier

Tags:

dataframe

r

Below is a subset of my data:

> head(dt)

   name    start     end
1:    1  3195984 3197398
2:    1  3203519 3205713
3:    2  3204562 3207049
4:    2  3411782 3411982 
5:    2  3660632 3661579
6:    3  3638391 3640590

dt <- data.frame(name = c(1, 1, 2, 2, 2, 3), start = c(3195984, 
3203519, 3204562, 3411782, 3660632, 3638391), end = c(3197398, 
3205713, 3207049, 3411982, 3661579, 3640590))

I want to calculate another value: the difference between the end coordinate of line n and the start coordinate of line n+1 but only if both elements share a name. To elaborate this is what I want a resulting data frame to look like:

   name    start     end    dist
1:    1  3195984 3197398
2:    1  3203519 3205713   -6121
3:    2  3204562 3207049
4:    2  3411782 3411982 −204733
5:    2  3660632 3661579 −248650
6:    3  3638391 3640590

The reason I want to do this is that I'm looking for dist values that are positive. One way I've tried this is to offset the start and end coordinates but then I run into a problem where I am comparing things with different names.

How does one do this in R?

like image 860
reedms Avatar asked Apr 27 '14 23:04

reedms


2 Answers

A data.table solution may be good here:

library(data.table)
dt <- as.data.table(dt)
dt[, dist := c(NA, end[-(length(end))] - start[-1]) , by=name]
dt

#   name   start     end    dist
#1:    1 3195984 3197398      NA
#2:    1 3203519 3205713   -6121
#3:    2 3204562 3207049      NA
#4:    2 3411782 3411982 -204733
#5:    2 3660632 3661579 -248650
#6:    3 3638391 3640590      NA

Assuming your data is sorted, you can also do it with base R functions:

dt$dist <- unlist(
  by(dt, dt$name, function(x) c(NA, x$end[-(length(x$end))] - x$start[-1]) )
)
like image 79
thelatemail Avatar answered Sep 17 '22 00:09

thelatemail


Using dplyr (with credit to @thelatemail for the calculation of dist):

library(dplyr)

dat.new <- dt %.%
  group_by(name) %.%
  mutate(dist = c(NA, end[-(length(end))] - start[-1]))
like image 25
alexwhan Avatar answered Sep 20 '22 00:09

alexwhan