I am trying to exclude rows have missing values (NA
) in all columns for that row AND for which all subsequent rows have only missing values (or is the last empty row itself), i.e. I want to remove trailing "all-NA
" rows.
I came up with the solution below, which works but is too slow (I am using this function on thousands of tables), probably because of the while
loop.
## Aux function to remove NA rows below table
remove_empty_row_last <- function(dt){
dt[ , row_empty := rowSums(is.na(dt)) == ncol(dt)]
while (dt[.N, row_empty] == TRUE) {
dt <- dt[1:(.N-1)]
}
dt %>% return()
}
d <- data.table(a = c(1,NA,3,NA,5,NA,NA), b = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d)
#EDIT2: adding more test cases
d2 <- data.table(A = c(1,NA,3,NA,5,1 ,NA), B = c(1,NA,3,4,5,NA,NA))
remove_empty_row_last(d2)
d3 <- data.table(A = c(1,NA,3,NA,5,NA,NA), B = c(1,NA,3,4,5,1,NA))
remove_empty_row_last(d3)
#Edit3:adding no NA rows test case
d4 <- data.table(A = c(1,2,3,NA,5,NA,NA), B = c(1,2,3,4,5,1,7))
d4 %>% remove_empty_row_last()
To drop rows with NA's in some specific columns, you can use the filter() function from the dplyr package and the in.na() function. First, the latter one determines if a value in a column is missing and returns a TRUE or FALSE. Next, the filter function drops all rows with an NA.
By using na. omit() , complete. cases() , rowSums() , and drop_na() methods you can remove rows that contain NA ( missing values) from R data frame.
To remove rows with an in R we can use the na. omit() and <code>drop_na()</code> (tidyr) functions. For example, na. omit(YourDataframe) will drop all rows with an.
This seems to work with all test cases.
The idea is to use a reverse cumsum
to filter out the NA
rows at the end.
library(data.table)
remove_empty_row_last_new <- function(d) {
d[d[,is.na(rev(cumsum(rev(ifelse(rowSums(!is.na(.SD))==0,1,NA)))))]]
}
d <- data.table(a=c(1,NA,3,NA,5,NA,NA),b=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d)
#> a b
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
d2 <- data.table(A=c(1,NA,3,NA,5,1 ,NA),B=c(1,NA,3,4,5,NA,NA))
remove_empty_row_last_new(d2)
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: 1 NA
d3 <- data.table(A=c(1,NA,3,NA,5,NA,NA),B=c(1,NA,3,4,5,1,NA))
remove_empty_row_last_new(d3)
#> A B
#> 1: 1 1
#> 2: NA NA
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
d4 <- data.table(A=c(1,2,3,NA,5,NA,NA),B=c(1,2,3,4,5,1,7))
remove_empty_row_last_new(d4)
#> A B
#> 1: 1 1
#> 2: 2 2
#> 3: 3 3
#> 4: NA 4
#> 5: 5 5
#> 6: NA 1
#> 7: NA 7
You'll have to check performance on your real dataset, but it seems a bit faster :
> microbenchmark::microbenchmark(remove_empty_row_last(d),remove_empty_row_last_new(d))
Unit: microseconds
expr min lq mean median uq max neval cld
remove_empty_row_last(d) 384.701 411.800 468.5251 434.251 483.7515 1004.401 100 b
remove_empty_row_last_new(d) 345.201 359.301 416.1650 382.501 450.5010 1104.401 100 a
Maybe this will be fast enough?
d[!d[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]]
a b
1: 1 1
2: NA NA
3: 3 3
4: NA 4
5: 5 5
Here's another approach that relies on rcpp.
library(Rcpp)
library(data.table)
Rcpp::cppFunction("
IntegerVector which_end_cont(LogicalVector x) {
const int n = x.size();
int consecutive = 0;
for (int i = n - 1; i >= 0; i--) {
if (x[i]) consecutive++; else break;
}
IntegerVector out(consecutive);
if (consecutive == 0)
return(out);
else
return(seq(1, n - consecutive));
}
")
remove_empty_row_last3 <- function(dt) {
lgl = rowSums(is.na(dt)) == length(dt)
ind = which_end_cont(lgl)
if (length(ind)) return(dt[ind]) else return(dt)
}
Basically, it
Using microbenchmark, this is about 3 times faster for cases in which there are empty rows at the end and about 15 times faster in which there are no empty rows.
Edit
If you have taken the time to add rcpp, the nice thing is that data.table has exported some of their internal functions so that they can be called directly from C. That can further simplify things and make it very, very quick, mainly because we can skip the NSE performed during [data.table
which is why all conditions are now ~15 times faster than the OP original function.
Rcpp::cppFunction("
SEXP mysub2(SEXP dt, LogicalVector x) {
const int n = x.size();
int consecutive = 0;
for (int i = n - 1; i >= 0; i--) {
if (x[i]) consecutive++; else break;
}
if (consecutive == 0)
return(dt);
else
return(DT_subsetDT(dt, wrap(seq(1, n - consecutive)), wrap(seq_len(LENGTH(dt)))));
}",
include="#include <datatableAPI.h>",
depends="data.table")
remove_empty_row_last4 <- function(dt) {
lgl = rowSums(is.na(dt)) == length(dt)
return(mysub2(dt, lgl))
}
dt = copy(d)
dt2 = copy(d2)
dt3 = copy(d3)
dt4 = copy(d4)
microbenchmark::microbenchmark(original = remove_empty_row_last(d3),
rcpp_subset = remove_empty_row_last4(dt3),
rcpp_ind_only = remove_empty_row_last3(dt3),
waldi = remove_empty_row_last_new(dt3),
ian = dt3[!dt3[,any(rowSums(is.na(.SD)) == ncol(.SD)) & rleid(rowSums(is.na(.SD)) == ncol(.SD)) == max(rleid(rowSums(is.na(.SD)) == ncol(.SD))),]])
## Unit: microseconds
## expr min lq mean median uq max neval
## original 498.0 519.00 539.602 537.65 551.85 621.6 100
## rcpp_subset 34.0 39.95 43.422 43.30 46.70 59.0 100
## rcpp_ind_only 116.9 129.75 139.943 140.15 146.35 177.7 100
## waldi 370.9 387.70 408.910 400.55 417.90 683.4 100
## ian 432.0 445.30 461.310 456.25 473.35 554.1 100
## andrew 120.0 131.40 143.153 141.60 151.65 197.5 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With