Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify the longest range of consecutive years in a list together with both the start and end date?

Tags:

r

Say I have a list of year integers as follows:

olap = c(1992, 1993, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2011, 2012, 2013, 2014);

What is the least complicated and most R-like way of identifying the longest range of consecutive years together with both the start date and the end date? I expect to obtain: length: 10, start year: 1997, end year: 2006.

I have been searching a little around the web including this site and people seem to recommend using rle() in this case. So my approach to solve the problem is as follows:

olap_diff_rle = rle(diff(olap));
max_diff_run = max(olap_diff_rle$lengths[olap_diff_rle$values==1]);
idx = cumsum(olap_diff_rle$lengths)[olap_diff_rle$lengths==max_diff_run] + 1;
max_olap_end_year = olap[idx];
max_olap_start_year = olap_end_year - max_diff_run;
max_olap = max_diff_run + 1;

But this appears horribly non-elegant. There must be a less complicated way of doing this!? I only want to use base R though, so no package. I have read that one might also use something like which(diff()!= 1) to identify the breaks and continue from there?

like image 218
harry Avatar asked Jul 22 '15 17:07

harry


2 Answers

I like the approach with diff and rle but would do it like this

with(rle(diff(olap)), {
    dur <- max(lengths[values==1])
    end <- sum(lengths[1:which(values==1 & lengths==dur)])+1
    list(duration=dur+1, start=olap[end-dur], end=olap[end])
})

# $duration
# [1] 10
# 
# $start
# [1] 1997
# 
# $end
# [1] 2006
like image 199
Rorschach Avatar answered Sep 23 '22 07:09

Rorschach


dplyr Here's another way:

library(dplyr) # overwrites/improves the lag function

jumps  = which(olap-lag(olap)>1)
starts = c(1,jumps)
ends   = c(jumps-1,length(olap))
maxrun = which.max(ends-starts)

olap[c(starts[maxrun],ends[maxrun])]
# [1] 1997 2006

For the duration of the run, you can use (ends-starts+1)[maxrun]. The data.table function shift is another option instead of dplyr's lag.


no packages Here's a simple lag function you can write in lieu of loading a package:

lag <- function(x) c(NA,head(x,-1))
like image 25
Frank Avatar answered Sep 20 '22 07:09

Frank