I'd like to combine rows of a data frame such that the ranges described by a "start" and "end" column include all values from the original data set. There might be overlaps, repeats, and nested ranges. Some ranges might be missing.
Here's an example of the kind of data I'd like to collapse:
data = data.frame(rbind(
c("Roger", 1, 10),
c("Roger", 10, 15),
c("Roger", 16, 17),
c("Roger", 3, 6),
c("Roger", 20, 25),
c("Roger", NA, NA),
c("Susan", 2, 8)))
names(data) = c("name", "start", "end")
data$start = as.numeric(as.character(data$start))
data$end = as.numeric(as.character(data$end))
The desired result would be:
name start end
Roger 1 17
Roger 20 25
Susan 2 8
My attempt has been to expand out every item in the range for each row. This works, but then I'm not sure how to shrink it back. Additionally, the full dataset I'm working with has ~30 million rows and very large ranges, so this method is VERY slow.
pb <- txtProgressBar(min = 0, max = length(data$name), style = 3)
mylist = list()
for(i in 1:length(data$name)){
subdata = data[i,]
if(is.na(subdata$start)){
mylist[[i]] = subdata
mylist[[i]]$each = NA
}
if(!is.na(subdata$start)){
sequence = seq(subdata$start, subdata$end)
mylist[[i]] = subdata[rep(1, each = length(sequence)),]
mylist[[i]]$daily = sequence
}
setTxtProgressBar(pb, i)
}
rbindlist(mylist)
First, select the rows you want to merge then open the Home tab and expand Merge & Centre. From these options select Merge Cells. After selecting Merge Cells it will pop up a message which values it is going to keep. Then click on OK.
I'm guessing IRanges is much more efficient for this, but...
library(data.table)
# remove missing values
DT = na.omit(setDT(data))
# sort
setorder(DT, name, start)
# mark threshold for a new group
DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name]
# group and summarise
DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )]
# name g V1 V2
# 1: Roger 0 1 17
# 2: Roger 1 20 25
# 3: Susan 1 2 8
How it works:
cummax
is the cumulative maximum, so the highest value so far, including the current row.shift
(which draws from the prior row).cumsum(some_condition)
is a standard way of making a grouping variable..N
is the last row of the group determined by by=
.The columns can be named in the last step like .(s = start[1L], e = end[.N])
if desired.
With date intervals. If working with dates, I'd suggest the IDate
class; just use as.IDate
to convert a Date
.
We can +1
on dates, but unfortunately cannot cummax
, so...
cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))
set.seed(1)
d = sample(as.IDate("2011-11-11") + 1:10)
cummax_idate(d)
# [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18"
# [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21"
I think this function can be used in place of cummax
.
The extra ()
in the function are there because setattr
won't print its output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With