I am trying to exclude rows have missing values (<code>NA</code>) in all columns for that row AND for which all subsequent rows have only missing values (or is the last empty row itself), i.e. I want to remove trailing "all-<code>NA</code>" rows. I came up with the solution below, which works but is too slow (I am using this function on thousands of tables), probably because of the <code>while</code> loop. <pre class="prettyprint"><code>## Aux function to remove NA rows below table remove_empty_row_last <- function(dt){ dt[ , row_empty := rowSums(is.na(dt)) == ncol(dt)] while (dt[.N, row_empty] == TRUE) { dt <- dt[1:(.N-1)] } dt %>% return() } d <- data.table(a = c(1,NA,3,NA,5,NA,NA), b = c(1,NA,3,4,5,NA,NA)) remove_empty_row_last(d) #EDIT2: adding more test cases d2 <- data.table(A = c(1,NA,3,NA,5,1 ,NA), B = c(1,NA,3,4,5,NA,NA)) remove_empty_row_last(d2) d3 <- data.table(A = c(1,NA,3,NA,5,NA,NA), B = c(1,NA,3,4,5,1,NA)) remove_empty_row_last(d3) #Edit3:adding no NA rows test case d4 <- data.table(A = c(1,2,3,NA,5,NA,NA), B = c(1,2,3,4,5,1,7)) d4 %>% remove_empty_row_last() </code></pre>

This seems to work with all test cases. The idea is to use a reverse <code>cumsum</code> to filter out the <code>NA</code> rows at the end. <pre class="prettyprint lang-r prettyprint-override"><code>library(data.table) remove_empty_row_last_new <- function(d) { d[d[,is.na(rev(cumsum(rev(ifelse(rowSums(!is.na(.SD))==0,1,NA)))))]] } d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA)) remove_empty_row_last_new(d) #> a b #> 1: 1 1 #> 2: NA NA #> 3: 3 3 #> 4: NA 4 #> 5: 5 5 d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA)) remove_empty_row_last_new(d2) #> A B #> 1: 1 1 #> 2: NA NA #> 3: 3 3 #> 4: NA 4 #> 5: 5 5 #> 6: 1 NA d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA)) remove_empty_row_last_new(d3) #> A B #> 1: 1 1 #> 2: NA NA #> 3: 3 3 #> 4: NA 4 #> 5: 5 5 #> 6: NA 1 d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7)) remove_empty_row_last_new(d4) #> A B #> 1: 1 1 #> 2: 2 2 #> 3: 3 3 #> 4: NA 4 #> 5: 5 5 #> 6: NA 1 #> 7: NA 7 </code></pre> You'll have to check performance on your real dataset, but it seems a bit faster : <pre class="prettyprint"><code>> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d)) Unit: microseconds expr min lq mean median uq max neval cld remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401 100 b remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401 100 a </code></pre>

Maybe this will be fast enough? <pre class="prettyprint"><code>d[!d[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]] a b 1: 1 1 2: NA NA 3: 3 3 4: NA 4 5: 5 5 </code></pre>

Remove trailing (last) rows with NAs in all columns

Tags:

r

missing-data

na

data.table

subset

I am trying to exclude rows have missing values (NA) in all columns for that row AND for which all subsequent rows have only missing values (or is the last empty row itself), i.e. I want to remove trailing "all-NA" rows.

I came up with the solution below, which works but is too slow (I am using this function on thousands of tables), probably because of the while loop.

## Aux function to remove NA rows below table
remove_empty_row_last <- function(dt){
  dt[ , row_empty := rowSums(is.na(dt)) == ncol(dt)] 
  while (dt[.N, row_empty] == TRUE) {
    dt <- dt[1:(.N-1)]
    
  }
  dt %>% return()
}

d <- data.table(a = c(1,NA,3,NA,5,NA,NA), b = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d)

#EDIT2: adding more test cases
d2 <- data.table(A = c(1,NA,3,NA,5,1 ,NA), B = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d2)
d3 <- data.table(A = c(1,NA,3,NA,5,NA,NA), B = c(1,NA,3,4,5,1,NA))
remove_empty_row_last(d3)

#Edit3:adding no NA rows test case
d4 <- data.table(A = c(1,2,3,NA,5,NA,NA), B = c(1,2,3,4,5,1,7))
d4 %>% remove_empty_row_last()

710

asked Jan 12 '21 17:01

LucasMation

3 Answers

This seems to work with all test cases.
The idea is to use a reverse cumsum to filter out the NA rows at the end.

library(data.table)

remove_empty_row_last_new <- function(d) {
  d[d[,is.na(rev(cumsum(rev(ifelse(rowSums(!is.na(.SD))==0,1,NA)))))]]
}

d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d)
#>     a  b
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5

d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d2)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6:  1 NA

d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_new(d3)
#>     A  B
#> 1:  1  1
#> 2: NA NA
#> 3:  3  3
#> 4: NA  4
#> 5:  5  5
#> 6: NA  1

d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_new(d4)
#>     A B
#> 1:  1 1
#> 2:  2 2
#> 3:  3 3
#> 4: NA 4
#> 5:  5 5
#> 6: NA 1
#> 7: NA 7

You'll have to check performance on your real dataset, but it seems a bit faster :

> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d))
Unit: microseconds
                         expr     min      lq     mean  median       uq      max neval cld
     remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401   100   b
 remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401   100  a

answered Oct 19 '22 10:10

Waldi

Maybe this will be fast enough?

d[!d[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]]
    a  b
1:  1  1
2: NA NA
3:  3  3
4: NA  4
5:  5  5

answered Oct 19 '22 09:10

Ian Campbell

Here's another approach that relies on rcpp.

library(Rcpp)
library(data.table)

Rcpp::cppFunction("
IntegerVector which_end_cont(LogicalVector x) {
  const int n = x.size();
  int consecutive = 0;
  
  for (int i = n - 1; i >= 0; i--) {
    if (x[i]) consecutive++; else break;
  }
  IntegerVector out(consecutive);
  if (consecutive == 0) 
    return(out);
  else
    return(seq(1, n - consecutive));
}
")

remove_empty_row_last3 <- function(dt) {
  lgl = rowSums(is.na(dt)) == length(dt)
  ind = which_end_cont(lgl)
  if (length(ind)) return(dt[ind]) else return(dt)
}

Basically, it

uses R to find out which rows are completely NA.
it uses rcpp to loop through the logical vector to determine how many consecutive empty rows there are at the end. Using rcpp allows us to minimize the memory allocated.
If there are no rows empty at the end, we prevent allocating memory by just returning the input rcpp. Otherwise, we allocate the sequence in rcpp and return it to subset the data.table.

Using microbenchmark, this is about 3 times faster for cases in which there are empty rows at the end and about 15 times faster in which there are no empty rows.

Edit

If you have taken the time to add rcpp, the nice thing is that data.table has exported some of their internal functions so that they can be called directly from C. That can further simplify things and make it very, very quick, mainly because we can skip the NSE performed during [data.table which is why all conditions are now ~15 times faster than the OP original function.

Rcpp::cppFunction("
SEXP mysub2(SEXP dt, LogicalVector x) {
const int n = x.size();
int consecutive = 0;
  
  for (int i = n - 1; i >= 0; i--) {
    if (x[i]) consecutive++; else break;
  }
  if (consecutive == 0) 
    return(dt);
  else
    return(DT_subsetDT(dt, wrap(seq(1, n - consecutive)), wrap(seq_len(LENGTH(dt)))));
}",
                  include="#include <datatableAPI.h>",
                  depends="data.table")

remove_empty_row_last4 <- function(dt) {
  lgl = rowSums(is.na(dt)) == length(dt)
  return(mysub2(dt, lgl))
}

dt = copy(d)
dt2 = copy(d2)
dt3 = copy(d3)
dt4 = copy(d4)
microbenchmark::microbenchmark(original = remove_empty_row_last(d3),
                               rcpp_subset = remove_empty_row_last4(dt3), 
                               rcpp_ind_only = remove_empty_row_last3(dt3),
                               waldi = remove_empty_row_last_new(dt3),
                               ian = dt3[!dt3[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]])


## Unit: microseconds
##           expr   min     lq    mean median     uq   max neval
##       original 498.0 519.00 539.602 537.65 551.85 621.6   100
##    rcpp_subset  34.0  39.95  43.422  43.30  46.70  59.0   100
##  rcpp_ind_only 116.9 129.75 139.943 140.15 146.35 177.7   100
##          waldi 370.9 387.70 408.910 400.55 417.90 683.4   100
##            ian 432.0 445.30 461.310 456.25 473.35 554.1   100
##         andrew 120.0 131.40 143.153 141.60 151.65 197.5   100

answered Oct 19 '22 11:10

Cole

Related questions
                            
                                How to have multiple groups in Python statsmodels linear mixed effects model?
                            
                                How to subset data in R without losing NA rows?
                            
                                calculating mean for every n values from a vector
                            
                                R: Using piping to pass a single argument to multiple locations in a function
                            
                                Specifying same limits for colorbar (legend) in ggplot2
                            
                                How to pass strings denoting expressions to dplyr 0.7 verbs?
                            
                                Rename in dplyr 0.7+ function
                            
                                tidytext, quanteda, and tm returning different tf-idf scores
                            
                                How can I add stars to broom package's tidy() function output?
                            
                                How to create a countdown timer in Shiny?
                            
                                Remove doubles with no decimal places
                            
                                r - Is it right to copy the old r version packages to the new folder that contains the packages of the new version?
                            
                                Gathering specific pairs of columns into rows by dplyr in R [duplicate]
                            
                                Save knitr::kable() output to html file R
                            
                                Multiplying all columns in dataframe by single column
                            
                                R - select last 2 columns
                            
                                How to extract number from character string?
                            
                                How to get every nth element from each group in a grouped data frame
                            
                                Using str_extract in R to extract a number before a substring with regex
                            
                                Sumproduct by condition in a data frame in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With