Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Consolidate rows based on date ranges

I'd like to combine rows of a data frame such that the ranges described by a "start" and "end" column include all values from the original data set. There might be overlaps, repeats, and nested ranges. Some ranges might be missing.

Here's an example of the kind of data I'd like to collapse:

data = data.frame(rbind(
    c("Roger", 1,  10),
    c("Roger", 10, 15),
    c("Roger", 16, 17),
    c("Roger", 3,  6),
    c("Roger", 20, 25),
    c("Roger", NA, NA),
    c("Susan", 2,  8)))
names(data) = c("name", "start", "end")
data$start = as.numeric(as.character(data$start))
data$end = as.numeric(as.character(data$end))

The desired result would be:

name   start end
Roger  1     17
Roger  20    25
Susan  2     8

My attempt has been to expand out every item in the range for each row. This works, but then I'm not sure how to shrink it back. Additionally, the full dataset I'm working with has ~30 million rows and very large ranges, so this method is VERY slow.

pb <- txtProgressBar(min = 0, max = length(data$name), style = 3)
mylist = list()
for(i in 1:length(data$name)){
  subdata = data[i,]
  if(is.na(subdata$start)){
    mylist[[i]] = subdata
    mylist[[i]]$each = NA
  }
  if(!is.na(subdata$start)){
    sequence = seq(subdata$start, subdata$end)  
    mylist[[i]] = subdata[rep(1, each = length(sequence)),]
    mylist[[i]]$daily = sequence
  }
  setTxtProgressBar(pb, i)
}

rbindlist(mylist)
like image 663
Nancy Avatar asked Aug 19 '16 19:08

Nancy


People also ask

How do I combine rows in Excel by criteria?

First, select the rows you want to merge then open the Home tab and expand Merge & Centre. From these options select Merge Cells. After selecting Merge Cells it will pop up a message which values it is going to keep. Then click on OK.


Video Answer


1 Answers

I'm guessing IRanges is much more efficient for this, but...

library(data.table)

# remove missing values
DT = na.omit(setDT(data))

# sort
setorder(DT, name, start)

# mark threshold for a new group
DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name]

# group and summarise
DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )]

#     name g V1 V2
# 1: Roger 0  1 17
# 2: Roger 1 20 25
# 3: Susan 1  2  8

How it works:

  • cummax is the cumulative maximum, so the highest value so far, including the current row.
  • To take the value excluding the current row, use shift (which draws from the prior row).
  • cumsum(some_condition) is a standard way of making a grouping variable.
  • .N is the last row of the group determined by by=.

The columns can be named in the last step like .(s = start[1L], e = end[.N]) if desired.


With date intervals. If working with dates, I'd suggest the IDate class; just use as.IDate to convert a Date.

We can +1 on dates, but unfortunately cannot cummax, so...

cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

set.seed(1)
d = sample(as.IDate("2011-11-11") + 1:10)
cummax_idate(d)
#  [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18"
#  [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21"

I think this function can be used in place of cummax.

The extra () in the function are there because setattr won't print its output.

like image 132
Frank Avatar answered Oct 05 '22 23:10

Frank