Say I have a list of year integers as follows:
olap = c(1992, 1993, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2011, 2012, 2013, 2014);
What is the least complicated and most R-like way of identifying the longest range of consecutive years together with both the start date and the end date? I expect to obtain: length: 10, start year: 1997, end year: 2006.
I have been searching a little around the web including this site and people seem to recommend using rle() in this case. So my approach to solve the problem is as follows:
olap_diff_rle = rle(diff(olap));
max_diff_run = max(olap_diff_rle$lengths[olap_diff_rle$values==1]);
idx = cumsum(olap_diff_rle$lengths)[olap_diff_rle$lengths==max_diff_run] + 1;
max_olap_end_year = olap[idx];
max_olap_start_year = olap_end_year - max_diff_run;
max_olap = max_diff_run + 1;
But this appears horribly non-elegant. There must be a less complicated way of doing this!? I only want to use base R though, so no package. I have read that one might also use something like which(diff()!= 1)
to identify the breaks and continue from there?
I like the approach with diff
and rle
but would do it like this
with(rle(diff(olap)), {
dur <- max(lengths[values==1])
end <- sum(lengths[1:which(values==1 & lengths==dur)])+1
list(duration=dur+1, start=olap[end-dur], end=olap[end])
})
# $duration
# [1] 10
#
# $start
# [1] 1997
#
# $end
# [1] 2006
dplyr Here's another way:
library(dplyr) # overwrites/improves the lag function
jumps = which(olap-lag(olap)>1)
starts = c(1,jumps)
ends = c(jumps-1,length(olap))
maxrun = which.max(ends-starts)
olap[c(starts[maxrun],ends[maxrun])]
# [1] 1997 2006
For the duration of the run, you can use (ends-starts+1)[maxrun]
. The data.table function shift
is another option instead of dplyr's lag
.
no packages Here's a simple lag function you can write in lieu of loading a package:
lag <- function(x) c(NA,head(x,-1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With