I'd like to combine rows of a data frame such that the ranges described by a "start" and "end" column include all values from the original data set. There might be overlaps, repeats, and nested ranges. Some ranges might be missing. Here's an example of the kind of data I'd like to collapse: <pre class="prettyprint"><code>data = data.frame(rbind( c("Roger", 1, 10), c("Roger", 10, 15), c("Roger", 16, 17), c("Roger", 3, 6), c("Roger", 20, 25), c("Roger", NA, NA), c("Susan", 2, 8))) names(data) = c("name", "start", "end") data$start = as.numeric(as.character(data$start)) data$end = as.numeric(as.character(data$end)) </code></pre> The desired result would be: <pre class="prettyprint"><code>name start end Roger 1 17 Roger 20 25 Susan 2 8 </code></pre> My attempt has been to expand out every item in the range for each row. This works, but then I'm not sure how to shrink it back. Additionally, the full dataset I'm working with has ~30 million rows and very large ranges, so this method is VERY slow. <pre class="prettyprint"><code>pb <- txtProgressBar(min = 0, max = length(data$name), style = 3) mylist = list() for(i in 1:length(data$name)){ subdata = data[i,] if(is.na(subdata$start)){ mylist[[i]] = subdata mylist[[i]]$each = NA } if(!is.na(subdata$start)){ sequence = seq(subdata$start, subdata$end) mylist[[i]] = subdata[rep(1, each = length(sequence)),] mylist[[i]]$daily = sequence } setTxtProgressBar(pb, i) } rbindlist(mylist) </code></pre>

I'm guessing IRanges is much more efficient for this, but... <pre class="prettyprint"><code>library(data.table) # remove missing values DT = na.omit(setDT(data)) # sort setorder(DT, name, start) # mark threshold for a new group DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name] # group and summarise DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )] # name g V1 V2 # 1: Roger 0 1 17 # 2: Roger 1 20 25 # 3: Susan 1 2 8 </code></pre> How it works: <ul> <li> <code>cummax</code> is the cumulative maximum, so the highest value so far, including the current row.</li> <li>To take the value excluding the current row, use <code>shift</code> (which draws from the prior row).</li> <li> <code>cumsum(some_condition)</code> is a standard way of making a grouping variable.</li> <li> <code>.N</code> is the last row of the group determined by <code>by=</code>.</li> </ul> The columns can be named in the last step like <code>.(s = start[1L], e = end[.N])</code> if desired. <hr> With date intervals. If working with dates, I'd suggest the <code>IDate</code> class; just use <code>as.IDate</code> to convert a <code>Date</code>. We can <code>+1</code> on dates, but unfortunately cannot <code>cummax</code>, so... <pre class="prettyprint"><code>cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate"))) set.seed(1) d = sample(as.IDate("2011-11-11") + 1:10) cummax_idate(d) # [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18" # [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21" </code></pre> I think this function can be used in place of <code>cummax</code>. The extra <code>()</code> in the function are there because <code>setattr</code> won't print its output.

Consolidate rows based on date ranges

Tags:

date

dataframe

r

data.table

I'd like to combine rows of a data frame such that the ranges described by a "start" and "end" column include all values from the original data set. There might be overlaps, repeats, and nested ranges. Some ranges might be missing.

Here's an example of the kind of data I'd like to collapse:

data = data.frame(rbind(
    c("Roger", 1,  10),
    c("Roger", 10, 15),
    c("Roger", 16, 17),
    c("Roger", 3,  6),
    c("Roger", 20, 25),
    c("Roger", NA, NA),
    c("Susan", 2,  8)))
names(data) = c("name", "start", "end")
data$start = as.numeric(as.character(data$start))
data$end = as.numeric(as.character(data$end))

The desired result would be:

name   start end
Roger  1     17
Roger  20    25
Susan  2     8

My attempt has been to expand out every item in the range for each row. This works, but then I'm not sure how to shrink it back. Additionally, the full dataset I'm working with has ~30 million rows and very large ranges, so this method is VERY slow.

pb <- txtProgressBar(min = 0, max = length(data$name), style = 3)
mylist = list()
for(i in 1:length(data$name)){
  subdata = data[i,]
  if(is.na(subdata$start)){
    mylist[[i]] = subdata
    mylist[[i]]$each = NA
  }
  if(!is.na(subdata$start)){
    sequence = seq(subdata$start, subdata$end)  
    mylist[[i]] = subdata[rep(1, each = length(sequence)),]
    mylist[[i]]$daily = sequence
  }
  setTxtProgressBar(pb, i)
}

rbindlist(mylist)

663

asked Aug 19 '16 19:08

Nancy

Video Answer

1 Answers

I'm guessing IRanges is much more efficient for this, but...

library(data.table)

# remove missing values
DT = na.omit(setDT(data))

# sort
setorder(DT, name, start)

# mark threshold for a new group
DT[, high_so_far := shift(cummax(end), fill=end[1L]), by=name]

# group and summarise
DT[, .(start[1L], end[.N]), by=.( name, g = cumsum(start > high_so_far + 1L) )]

#     name g V1 V2
# 1: Roger 0  1 17
# 2: Roger 1 20 25
# 3: Susan 1  2  8

How it works:

cummax is the cumulative maximum, so the highest value so far, including the current row.
To take the value excluding the current row, use shift (which draws from the prior row).
cumsum(some_condition) is a standard way of making a grouping variable.
.N is the last row of the group determined by by=.

The columns can be named in the last step like .(s = start[1L], e = end[.N]) if desired.

With date intervals. If working with dates, I'd suggest the IDate class; just use as.IDate to convert a Date.

We can +1 on dates, but unfortunately cannot cummax, so...

cummax_idate = function(x) (setattr(cummax(unclass(x)), "class", c("Date", "IDate")))

set.seed(1)
d = sample(as.IDate("2011-11-11") + 1:10)
cummax_idate(d)
#  [1] "2011-11-14" "2011-11-15" "2011-11-16" "2011-11-18" "2011-11-18"
#  [6] "2011-11-19" "2011-11-20" "2011-11-20" "2011-11-21" "2011-11-21"

I think this function can be used in place of cummax.

The extra () in the function are there because setattr won't print its output.

132

answered Oct 05 '22 23:10

Frank

Related questions
                            
                                How to make R package recommend a package hosted on GitHub?
                            
                                Aggregate one data frame by time intervals from another data frame
                            
                                sequence of monthly dates making sure it's the same day, or the last day of month in case of invalid
                            
                                How to calculate the mean of the top 10% in R
                            
                                Should I reset Java heap space maximum after use?
                            
                                remove known exact row in huge csv
                            
                                Open a dta file in R
                            
                                Measure distance between the first and last location record per day and animal in R
                            
                                R: Producing frequency table by selecting certain rows
                            
                                Assign a vector to a specific existing row of data table in R
                            
                                Gzip error when reading R data files into julia
                            
                                Lag / lead by group in R and dplyr
                            
                                Major and minor tickmarks with plotly
                            
                                dplyr's filter function: how to return every value (or «cancel» the effect of filter)?
                            
                                Creating data partition in R
                            
                                Perfectly align several plots
                            
                                Backreference in R
                            
                                How to calculate the area of ellipse drawn by ggplot2?
                            
                                Use R to create chart in Excel sheet
                            
                                Add dynamic tabs in shiny dashboard using conditional panel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With