Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count a sequence to include NA values

Tags:

r

sequence

Here is a sample data frame that resembles a larger data set:

Day <- c(1, 2, NA, 3, 4, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, 1, 2, NA, NA, 3, 4, 5)
y   <- rpois(length(Day), 2)
z   <- seq(1:length(Day)) + 500
df  <- data.frame(z, Day, y)

If there is a sequence of 4 or more missing values (NAs) in the Day column, that sequence represents a gap between cohorts in the study. If there are less than 4 NAs in a sequence, then the missing value is still considered part of the cohort (e.g. row 3 is part of cohort 1, but row 8 is not). In the sample data frame, there are 3 cohorts (Cohort 1: rows 1-5, Cohort 2: rows 11-13, and Cohort 3: rows 18-24). I would like to add a column listing the cohort number and another column listing the cohort study day. Here is the code I used:

require(dplyr)
CheckNA        <- rle(is.na(df$Day))
CheckNA$values <- CheckNA$lengths >= 4 & CheckNA$values == 1
ListNA         <- rep(CheckNA$values, CheckNA$lengths)
df$Co          <- rep(c(1, NA, 2, NA, 3), rle(ListNA)$lengths) %>% as.factor()

df <- df %>% 
  group_by (Co) %>% 
  mutate(CoDay = seq(Co)) %>% 
  as.data.frame()

df$CoDay <- ifelse(is.na(df$Co), NA, df$CoDay)

Is there a more efficient way to accomplish this task? I'm specially looking for code to avoid having to list the cohort number, since my actual data set will have over 10 cohorts. I currently just list the sequence that should be repeated: c(1, NA, 2, NA, 3).

like image 218
Tania Alarcon Avatar asked Apr 06 '17 20:04

Tania Alarcon


People also ask

How do you count the number of NA in a column?

The best way to count the number of NA's in the columns of an R data frame is by using the colSums() function. As the name suggests, the colSums() function calculates the sum of all elements per column.

How do I count rows in NA?

You can use the is.na() function for this purpose. You can use the rowSums() function to do this. As the name suggests, this function sums the values of all elements in a row. Since TRUEs are equal to 1 and FALSEs are equal to 0, summing the number of TRUEs is the same as counting the number of NA's.

Does NA omit removes all rows?

To remove all rows having NA, we can use na. omit function. For Example, if we have a data frame called df that contains some NA values then we can remove all rows that contains at least one NA by using the command na.


1 Answers

I'd make a change here

CheckNA        <- rle(is.na(df$Day))
CheckNA$values <- CheckNA$lengths >= 4 & CheckNA$values == 1
CheckNA$values <- ifelse(!CheckNA$values, cumsum(CheckNA$values)+1, NA)
df$Co <- inverse.rle(CheckNA)

I kept the first two lines the same, then I used cumsum() to assign new IDs at each break. This means you won't have to hard-code any values. With the new values, you can use inverse.rle much in the same way you used rep() to expand the new ID out to each of the rows.

If you turn that into a function, you can clean up the dplyr bits

id_NA_break <- function(x) {
  CheckNA        <- rle(is.na(x))
  CheckNA$values <- CheckNA$lengths >= 4 & CheckNA$values == 1
  CheckNA$values <- ifelse(!CheckNA$values, cumsum(CheckNA$values)+1, NA)
  inverse.rle(CheckNA)  
}

df  <- data.frame(z, Day, y)
df %>% 
  mutate(Co=id_NA_break(Day)) %>%
  group_by(Co) %>% 
  mutate(CoDay = ifelse(is.na(Co), NA, seq(Co))) 
like image 81
MrFlick Avatar answered Sep 29 '22 09:09

MrFlick