Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

First and Last value before NA

I am trying to obtain the first and last value for different segments before an NA value in a vector. Here is an example:

xx = seq(1, 122, by = 1)
xx[c(2:10, 14, 45:60, 120:121)] = NA

In turn, my results we would 1; 11 and 13; 15 and 44; 61 and 119; 122.

like image 708
menzd53 Avatar asked Dec 14 '22 18:12

menzd53


1 Answers

Using a c++ function to do some looping will be fast on a large set.

This function returns a 2-column matrix, the first column gives the 'start' of the sequence of numbers, the second column gives the 'end' of the sequence.

library(Rcpp)

cppFunction('NumericMatrix naSeq(NumericVector myVec) {

    int n = myVec.size();
    NumericVector starts(n); // pre-allocate
    NumericVector ends(n);   // pre-allocate
    starts.fill(NumericVector::get_na());
    ends.fill(NumericVector::get_na());
    int startCounter = 0;
    int endCounter = 0;
    bool firstNumber = !NumericVector::is_na(myVec[0]); // initialise based on first value

    // groups are considered sequential numbers without an NA between them

    for (int i = 0; i < (n-1); i++) {
        if ( !NumericVector::is_na(myVec[i]) && NumericVector::is_na(myVec[i+1]) ) {
            if (i == 0 && firstNumber) {
                startCounter++;
            }
            ends[endCounter] = i + 1;
            endCounter++;
        }

        if (NumericVector::is_na(myVec[i]) && !NumericVector::is_na(myVec[i+1]) ) {
            if ( i == 0 && !firstNumber){
                endCounter++;
            }
            starts[startCounter] = i + 2;
            startCounter++;
        }
    }


    int matSize = startCounter > endCounter ? startCounter : endCounter; 
    IntegerVector idx = seq(0, matSize);
    NumericMatrix m(matSize, 2);

    starts = starts[idx];
    ends = ends[idx];

    m(_, 0) = starts;
    m(_, 1) = ends;

    return m;

}')

naSeq(xx)

which gives

#      [,1] [,2]
# [1,]   NA    1
# [2,]   11   13
# [3,]   15   44
# [4,]   61  119
# [5,]  122   NA

Benchmarking

If you do care about speed, here's a quick benchmark of the solutions. Note that I'm taking the functions as-is from each answer, regardless of the format (or even content) of the result of each function.

library(microbenchmark)

set.seed(123)
xx <- seq(1:1e6)
naXX <- sample(xx, size = 1e5)
xx[naXX] <- NA 

mb <- microbenchmark(
    late = { latemail(xx) },
    sym = { naSeq(xx) },
    www = { www(xx) },
    mkr = { mkr(xx) },
    times = 5
)

print(mb, order = "median")

# Unit: milliseconds
# expr        min         lq       mean     median         uq        max neval
#  sym   22.66139   23.26898   27.18414   23.48402   27.85917   38.64716     5
#  www   45.11008   46.69587   55.73575   56.97421   61.63140   68.26719     5
#  mkr  369.69303  384.15262  427.35080  392.26770  469.59242  521.04821     5
# late 2417.21556 2420.25472 2560.41563 2627.19973 2665.19272 2672.21543     5

Using

latemail <- function(xx) {
    nas <- is.na(xx)
    by(xx[!nas], cumsum(nas)[!nas], function(x) x[unique(c(1,length(x)))] )
}

www <- function(xx) {
    RLE <- rle(is.na(xx))
    L <- RLE$lengths
    Index <- cumsum(L[-length(L)]) + (1:(length(L) - 1) + 1) %% 2

    matrix(c(Index[1], NA, Index[2:length(Index)], NA), ncol = 2, byrow = TRUE)
}

library(dplyr)
mkr <- function(xx) {
    df <- data.frame(xx = xx)
    df %>% mutate(value = ifelse(is.na(xx), ifelse(!is.na(lag(xx)), lag(xx),
                                                                                                 ifelse(!is.na(lead(xx)),lead(xx), NA)), NA)) %>%
        select(value) %>%
        filter(!is.na(value))
}
like image 111
SymbolixAU Avatar answered Dec 17 '22 22:12

SymbolixAU