I have a set of data that looks like this:
Date boolean
407 2006-06-01 1
408 2006-06-02 1
409 2006-06-03 1
410 2006-06-04 NA
411 2006-06-05 0
412 2006-06-06 1
413 2006-06-07 1
414 2006-06-08 0
415 2006-06-09 1
From this, I am trying to create a new data frame that will show me what dates my runs of 1's occur as well as how long these runs are, with the column headers: 1) start date, 2) end date, and 3) length of run.
Ultimately, I want to create a data frame that looks like this from the data I have above:
Start Date End Date Length of Run
1 2006-06-01 2006-06-03 3
2 2006-06-06 2006-06-07 2
I have a few NA's in my data that I need to ignore throughout my data as well.
You could do this with dplyr
, using mutate
to convert missing boolean
values to 0, group_by
to compute groups with constant values of variable boolean
, filter
to limit to groups where boolean
was set to 1 and where the group had more than one member, and then summarize
to grab the relevant summary information. (I take a few extra steps to remove the grouping variable at the end).
library(dplyr)
dat %>%
mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
filter(boolean == 1 & n() > 1) %>%
summarize("Start Date"=min(as.character(Date)),
"End Date"=max(as.character(Date)),
"Length of Run"=n()) %>%
ungroup() %>%
select(-matches("group"))
# Start Date End Date Length of Run
# (chr) (chr) (int)
# 1 2006-06-01 2006-06-03 3
# 2 2006-06-06 2006-06-07 2
Data:
dat <- read.table(text=" Date boolean
407 2006-06-01 1
408 2006-06-02 1
409 2006-06-03 1
410 2006-06-04 NA
411 2006-06-05 0
412 2006-06-06 1
413 2006-06-07 1
414 2006-06-08 0
415 2006-06-09 1", header=T)
We can also use data.table
to subset and cast the data as needed. First we create an id
column with rleid(boolean)
. Next, subset the data according to the necessary conditions. Lastly, we create start
, end
, and run
with the subsetted data:
library(data.table)
setDT(dat)[,id := rleid(boolean)][
,.SD[.N > 1 & boolean == 1],id][
,.(start=Date[1],end=Date[.N], run=.N),id]
# id start end run
#1: 1 2006-06-01 2006-06-03 3
#2: 4 2006-06-06 2006-06-07 2
Another answer using base, reformatting this answer's use of cumsum
and diff
.
#Remove ineligible dates (defined by 0 or NA)
x1 <- x[!(x$boolean %in% c(NA, 0)), ]
x1$Date <- as.Date(x1$Date) #Convert date from factor to Date class
#Starting at 0, if the difference between eligible dates is >1 day,
# add 1 (TRUE) to the previous value, else add 0 (FALSE) to previous value
#This consecutively numbers each series
x1$SeriesNo <- cumsum(c(0, diff(x1$Date) > 1))
# Date boolean SeriesNo
#407 2006-06-01 1 0
#408 2006-06-02 1 0
#409 2006-06-03 1 0
#412 2006-06-06 1 1
#413 2006-06-07 1 1
#415 2006-06-09 1 2
# Aggregate: Perform the function FUN on variable Date by each SeriesNo group
x2 <- as.data.frame(as.list(
aggregate(Date ~ SeriesNo, data= x1, FUN=function(zz)
c(Start = min(zz), End= max(zz), Run = length(zz) ))
)) #see note after this code block
#Output is in days since origin. Reconvert them into Date class
x2$Date.Start <- as.Date(x2$Date.Start, origin = "1970-01-01")
x2$Date.End <- as.Date(x2$Date.End, origin = "1970-01-01")
# SeriesNo Date.Start Date.End Date.Run
#1 0 2006-06-01 2006-06-03 3
#2 1 2006-06-06 2006-06-07 2
#3 2 2006-06-09 2006-06-09 1
A note on "buggy" output from aggregate
: Using aggregate to apply several functions on several variables in one call
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With