Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - Split data frame row into two rows

Tags:

r

data.table

I have a 2 tables (data & reference; toy example below). These tables have START and END positions that I'd like to check for overlaps (using something like foverlaps from the data.table package) and then split the values as shows below.

>data  <- data.table(ID=c(1,2,3), Chrom=c(1,1,2), Start=c(1,500,1000), End=c(900,5000,5000), Probes=c(899,4500,4500))
>Ref.table <- data.table(Chrom=c(1,2), Split=c(1000,2000))

>Ref.table
Chrom    Split
1        1000
2        2000

>data
ID    Chrom    Start    End    Probes
1     1        1        900    899
2     1        500      5000   4500
3     2        1000     5000   4000

As you can see, ID 1 has no overlap with the reference table, so it would be left alone. However, IDs 2&3, I'd like to split based on Ref.table.

The resulting table I'd like to get is:

>result
ID    Chrom    Start    End    Probes
1     1        1        900    899
2     1        500      1000   500
2     1        1001     5000   4000
3     2        1000     2000   1000
3     2        2001     5000   3000

As I'm sure you can see, there are two parts to this: 1. Split the range into two columns based on a separate table 2. Split the # probes proportionally between the two parts

I've been searching for an R package that can do this (split ranges by Chromosome arm), but haven't been able to find one that does as shown above. Any links to functions packages would be appreciated, but I'm also willing to code this myself...with a little help.

So far, I've only been able to use foverlaps to determine if there are overlaps: example:

>foverlaps(Ref.table[data[14]$Chrom], data[14], which=TRUE)
     xid   yid
1:    1     1
like image 278
Gaius Augustus Avatar asked Sep 01 '15 15:09

Gaius Augustus


2 Answers

Here's a possible foverlaps solution (as mentioned in the Q).

The first two steps are simple and pretty much idiomatic, add an End column to Ref.table so we will have overlaping intervals, then key both data sets by Chrom and the interval columns (in v 1.9.5+ you can now specify by.x and by.y instead) and simply run foverlaps

library(data.table)
setDT(Ref.table)[, End := Split]
setkey(Ref.table)
setkey(setDT(data), Chrom, Start, End)
res <- foverlaps(data, Ref.table)
res
#    Chrom Split  End ID Start i.End Probes
# 1:     1    NA   NA  1     1   900    899
# 2:     1  1000 1000  2   500  5000   4500
# 3:     2  2000 2000  3  1000  5000   4000

Now that we have the overlaps, we need to increase the data set size according to our matches. We can condition this on is.na(Split) (which means no overlap was found). I'm not sure if this part could be done more efficiently

res2 <- res[, if(is.na(Split)) .SD else rbind(.SD, .SD), by = .(ID, Chrom)]
## Or, if you only have one row per group, maybe
## res2 <- res[, if(is.na(Split)) .SD else .SD[c(1L,1L)], by = .(ID, Chrom)]

Now, the last two steps will update the End and Start columns and then the Probes column according to the new column values

res2[!is.na(Split), `:=`(i.End = c(Split[1L], i.End[-1L]),
                         Start = c(Start[-1L], Split[1L] + 1L)), 
     by = .(ID, Chrom)]
res2[!is.na(Split), Probes := i.End - Start]
res2
#    ID Chrom Split  End Start i.End Probes
# 1:  1     1    NA   NA     1   900    899
# 2:  2     1  1000 1000   500  1000    500
# 3:  2     1  1000 1000  1001  5000   3999
# 4:  3     2  2000 2000  1000  2000   1000
# 5:  3     2  2000 2000  2001  5000   2999

(You can remove unwanted columns if you wish)

like image 176
David Arenburg Avatar answered Sep 25 '22 06:09

David Arenburg


First define a splitting function:

splitter<-function(data, reftable){
  splitsite <- which(reftable$Chrom == data$Chrom)
  if(reftable$Split[splitsite] > data$Start && reftable$Split[splitsite] <= data$End){
    return(data.frame(ID = data$ID,
                      Chrom = data$Chrom,
                      Start = c(data$Start, reftable$Split[splitsite] + 1),
                      End = c(reftable$Split[splitsite],data$End),
                      Probes = c((reftable$Split[splitsite]- data$Start)*data$Probes/(data$End-data$Start),
                                 ((data$End - (reftable$Split[splitsite] + 1))*data$Probes/(data$End-data$Start)))))
  } else {
    return(data)
  }  
}

then we can run it on each line using dplyr:

library(dplyr)
data %>% group_by(ID) %>%
         do(splitter(., ref.table))

giving the below. You can see it has 3999 and 2999 rather than your 4000 and 3000, I'm not sure which you want based on your row 1. You can fix it by takign out the +1 in ((data$End - (reftable$Split[splitsite] + 1))

  ID Chrom Start  End Probes
1  1     1     1  900    899
2  2     1   500 1000    500
3  2     1  1001 5000   3999
4  3     2  1000 2000   1000
5  3     2  2001 5000   2999
like image 38
jeremycg Avatar answered Sep 21 '22 06:09

jeremycg