Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove rows with NAs only for all select columns at start and end of dataset in R

Tags:

r

na

dplyr

Given the dataset below...

col1  col2  var1  var2  var3
   1     1    NA    NA    NA
   2     2    NA    NA    NA
   3     3    NA     3    NA
   4     4     4     4     4
   5     5     5     5     5
   6     6     6    NA     6
   7     7    NA     7     7
   8     8    NA    NA    NA
   9     9     9    NA    NA
  10    10    NA    NA    NA

...how can I remove rows containing NAs for all select columns only at the start and end in an automated way (and preferably using tidyverse)?

col1  col2  var1  var2  var3
   3     3    NA     3    NA
   4     4     4     4     4
   5     5     5     5     5
   6     6     6    NA     6
   7     7    NA     7     7
   8     8    NA    NA    NA
   9     9     9    NA    NA

So in the example above, how can I remove those rows containing NAs for all of columns var1-var3 only at the start and end of the dataset? That is, only rows 1, 2 and 10 should disappear. (Although row 8 contains NAs for all of these select variables, it should not be removed as it is not at the 'start' or 'end' of the dataset.)

Here is my tidyverse solution:

library(tidyverse)
have <- tibble(col1 = 1:10, col2 = 1:10, var1 = 1:10, var2 = 1:10, var3 = 1:10)
have[c(1, 2, 10), 3:5] <- NA
have[3, c(3, 5)] <- NA
have[6, 4] <- NA
have[7, 3] <- NA
have[8, 3:5] <- NA
have[9, 4:5] <- NA

no_select_vars <- 3 # The number of select variables
want <- have |>
  mutate(no_missing = rowSums(across(-c(col1, col2), ~ is.na(.x)))) |>
  slice(first(which(no_missing < no_select_vars)):n()) |>
  slice(1:last(which(no_missing < no_select_vars))) |>
  select(-no_missing)

Is there an existing function for this purpose, or a more elegant solution than mine?

like image 667
tumbling_tower Avatar asked Oct 29 '25 09:10

tumbling_tower


2 Answers

One approach is to create an indicator if all of the selected columns are missing (in this case, no_miss if not all are missing).

Then, you can filter with cumany removing all rows until a TRUE is reached in no_miss at beginning, and then same logic in reverse at the end.

library(tidyverse)

have %>%
  mutate(no_miss = !if_all(contains("var"), is.na)) %>%
  filter(cumany(no_miss) & rev(cumany(rev(no_miss))))

Output

   col1  col2  var1  var2  var3 no_miss
  <int> <int> <int> <int> <int> <lgl>  
1     3     3    NA     3    NA TRUE   
2     4     4     4     4     4 TRUE   
3     5     5     5     5     5 TRUE   
4     6     6     6    NA     6 TRUE   
5     7     7    NA     7     7 TRUE   
6     8     8    NA    NA    NA FALSE  
7     9     9     9    NA    NA TRUE 
like image 52
Ben Avatar answered Oct 31 '25 22:10

Ben


Here's another way, using base R:

# Find rows where every column we're looking at is NA
all_na_row <- rowSums(is.na(have[,grep( "var", names(have))])) == sum(grepl("var", names(have)))

# Check if the cumsum of all NA rows equals the row number
# Do the same for the reverse, and take the minimum of each to always check the closest edge
have[pmin(cumsum(all_na_row), rev(cumsum(rev(all_na_row)))) != pmin(seq_along(all_na_row), rev(seq_along(all_na_row))),]

Output:

# A tibble: 7 × 5
   col1  col2  var1  var2  var3
  <int> <int> <int> <int> <int>
1     3     3    NA     3    NA
2     4     4     4     4     4
3     5     5     5     5     5
4     6     6     6    NA     6
5     7     7    NA     7     7
6     8     8    NA    NA    NA
7     9     9     9    NA    NA
like image 40
Mark Avatar answered Oct 31 '25 23:10

Mark