Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get frequency counts using column breaks by row?

Tags:

r

I have a data frame which tracks service involvement (srvc_inv {1, 0}) for individual x (Bob) over a timeframe of interest (years 1900-1999).

library(tidyverse)

dat <- data.frame(name = rep("Bob", 100),
              day = seq(as.Date("1900/1/1"), as.Date("1999/1/1"), "years"),
              srvc_inv = c(rep(0, 25), rep(1, 25), rep(0, 25), rep(1, 25)))

As we can see, Bob has two service episodes: one episode between rows 26:50, and the other between rows 76:100.

If we want to determine any service involvement for Bob during the timeframe, we can use a simple max statement as shown below.

dat %>% 
  group_by(name) %>% 
  summarise(ever_inv = max(srvc_inv))

However, I would like to determine the number of service episodes that Bob had during the timeframe of interest (in this case, 2). A distinct service episode would be identified by a break in service involvement over consecutive dates. Anybody have any idea how to program this? Thanks!

like image 292
DJC Avatar asked Aug 31 '19 16:08

DJC


3 Answers

One more solution based on base R rle

library(dplyr)
dat %>% group_by(name) %>% 
        summarise(ever_inv = length(with(rle(srvc_inv), lengths[values==1])))

# A tibble: 1 x 2
name  ever_inv
  <fct>    <int>
1 Bob          2
like image 147
A. Suliman Avatar answered Sep 20 '22 17:09

A. Suliman


One possibility could be:

dat %>%
 group_by(name) %>%
 mutate(rleid = with(rle(srvc_inv), rep(seq_along(lengths), lengths))) %>%
 summarise(ever_inv = n_distinct(rleid[srvc_inv == 1]))

  name  ever_inv
  <fct>    <int>
1 Bob          2
like image 34
tmfmnk Avatar answered Sep 20 '22 17:09

tmfmnk


Alternatively to rle() you can use diff():

dat %>%
  group_by(name) %>%
  summarise(ever_inv = sum(diff(c(0, srvc_inv)) > 0))

#   A tibble: 1 x 2
#   name  ever_inv
#   <fct>    <int>
# 1 Bob          2

Assuming that srvc_inv is either 0 or 1, diff(srvc_inv) == 1 only when xi is 1, and xi-1 is 0. It turns into 0 or -1 otherwise. I added 0 before srvc_inv for a case when it starts from 1s run.

And with rle(), from my opinion, there is even simpler solution:

dat %>%
  group_by(name) %>%
  summarise(ever_inv = sum(rle(srvc_inv)$value))

#   A tibble: 1 x 2
#   name  ever_inv
#   <fct>    <int>
# 1 Bob          2

Assuming that srvc_inv is either 0 or 1, that's enough just to sum values component of rle object, which returns the number of 1s runs.

like image 35
utubun Avatar answered Sep 22 '22 17:09

utubun