Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shifting non-NA cells to the left

There are many NA's in my dataset and I need to shift all those cells (at row level) to the left.

Example- my dataframe:

    df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
    df
         x    y    z
    1    l <NA>    u
    2    m    b <NA>
    3 <NA>    c    w
    4 <NA> <NA>    x
    5    p <NA>    y

I want the above dataframe converted into this:

      x    y  z
    1 l    u NA
    2 m    b NA
    3 c    w NA
    4 x <NA> NA
    5 p    y NA

Please help.

Thanks.

like image 760
sidpat Avatar asked Apr 25 '14 05:04

sidpat


4 Answers

There have been a number of duplicate questions (here and here) since this one was asked. I have collected (and improved) some of the more idiomatic answers and benchmarked them against my own Rcpp implementation.

For simplicity, I have compared functions that take as input and return as output a character matrix, not a data frame containing only character variables. You can always coerce from one to the other with as.matrix and as.data.frame (see bottom for example).

Rcpp::sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
void shift_na_in_place(CharacterMatrix x)
{
  int m = x.nrow();
  int n = x.ncol();
  for (int i = 0, k = 0, k0 = 0; i < m; ++i) {
    for (int j = 0; j < n; ++j) {
      if (x[k] != NA_STRING) {
        x[k0] = x[k];
        k0 += m;
      }
      k += m;
    }
    while (k0 < k) {
      x[k0] = NA_STRING;
      k0 += m;
    }
    k = (k % m) + 1;
    k0 = k;
  }
  if (x.attr("dimnames") != R_NilValue) {
    List dn = x.attr("dimnames");
    dn[1] = R_NilValue;
    if (dn.attr("names") != R_NilValue) {
      CharacterVector ndn = dn.attr("names");
      ndn[1] = "";
    }
  }
}

// [[Rcpp::export]]
CharacterMatrix shift_na(CharacterMatrix x)
{
  CharacterMatrix y = clone(x);
  shift_na_in_place(y);
  return y;
}
')
f1 <- function(x) {
  t(apply(x, 1L, function(y) {r <- is.na(y); c(y[!r], y[r])}))
}
f2 <- function(x) {
  t(apply(x, 1L, function(y) y[order(is.na(y), method = "radix")]))
}
f3 <- function(x) {
  d <- dim(x)
  dn <- dimnames(x)
  matrix(x[order(row(x), is.na(x), method = "radix")],
         nrow = d[1L], ncol = d[2L], byrow = TRUE,
         dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
f4 <- function(x) {
  d <- dim(x)
  dn <- dimnames(x)
  matrix(x[order(is.na(x) + (row(x) - 1L) * 2L + 1L, method = "radix")],
         nrow = d[1L], ncol = d[2L], byrow = TRUE,
         dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
set.seed(1L)
m <- 1e+05L
n <- 10L
x <- sample(c(letters, NA), size = m * n, replace = TRUE, prob = c(rep(1, 26), 13))
dim(x) <- c(m, n)
microbenchmark::microbenchmark(shift_na(x), f1(x), f2(x), f3(x), f4(x), check = "identical")
Unit: milliseconds
        expr       min        lq      mean    median        uq       max neval
 shift_na(x)  10.04959  10.32019  10.82935  10.41968  10.60104  22.69412   100
       f1(x) 141.95959 150.83875 180.49025 167.01266 211.52478 248.07587   100
       f2(x) 722.27211 759.75710 780.69368 773.26920 797.01253 857.07905   100
       f3(x)  18.45201  19.15436  22.47760  21.59577  22.40543  66.47121   100
       f4(x)  30.03168  31.62765  35.22960  33.92801  35.06384  85.92661   100

The dedicated Rcpp implementation shift_na is fastest, as you might expect, but f3 and f4 are not much slower. A few finer points:

  • f1 and f2 call apply, which is built on an R for loop, so it is not surprising that they are slow.

  • f3 and f4 have to allocate memory for is.na(x) and row(x), which could be a hindrance for large enough x.

  • f3 is faster than f4 because the "radix" sort uses a faster algorithm when the range (maximum minus minimum) of the integer vector being sorted is less than 100000 (see ?sort). Here, the ranges are:

                              is.na(x):      1
                                row(x):  99999
    is.na(x) + (row(x) - 1L) * 2L + 1L: 199999
    
  • shift_na(x) creates a copy of x and modifies the copy in place. If you cannot or do not want to allocate memory for a copy because x is very large, then you can do shift_na_in_place(x) to modify x in place.

  • shift_na_in_place should be preferred over shift_na if you have a data frame data containing character variables, rather than a character matrix. In this situation, there is no need to preserve the intermediate as.matrix(data); it can be modified in place:

    x <- as.matrix(data)
    shift_na_in_place(x)
    newdata <- as.data.frame(x)
    
like image 125
Mikael Jagan Avatar answered Sep 18 '22 18:09

Mikael Jagan


You can use the standard apply function:

df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
df2 = as.data.frame(t(apply(df,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
colnames(df2) = colnames(df)

> df
     x    y    z
1    l <NA>    u
2    m    b <NA>
3 <NA>    c    w
4 <NA> <NA>    x
5    p <NA>    y
> df2
  x    y    z
1 l    u <NA>
2 m    b <NA>
3 c    w <NA>
4 x <NA> <NA>
5 p    y <NA>
like image 22
Hans Roggeman Avatar answered Nov 02 '22 15:11

Hans Roggeman


Thanks to @Richard Scriven for good observation

A) with is.na and order, lapply and rbind for aggregation

nosort.df<-do.call(rbind,lapply(1:nrow(df),function(x) { z=df[x,][order(is.na(df[x,]))];colnames(z)<-c("x","y","z");return(z) } ))

> nosort.df
  x    y    z
1 l    u <NA>
2 m    b <NA>
3 c    w <NA>
4 x <NA> <NA>
5 p    y <NA>

B) if sorted rows are required:

with sort, lapply and rbind

sort.df<-do.call(rbind,lapply(1:nrow(df),function(x) { z=sort(df[x,],na.last=TRUE);colnames(z)<-c("x","y","z");return(z) } ))

> sort.df
  x    y    z
1 l    u <NA>
2 b    m <NA>
3 c    w <NA>
4 x <NA> <NA>
5 p    y <NA> 
like image 6
Silence Dogood Avatar answered Nov 02 '22 16:11

Silence Dogood


I have included a function for this task in my package dedupewider (available on CRAN). It allows to move NA to right, left or even top and bottom:

library(dedupewider)

df <- data.frame(x = c("l", "m", NA, NA, "p"),
                 y = c(NA, "b", "c", NA, NA),
                 z = c("u", NA, "w", "x", "y"))

na_move(df) # 'right' direction is by default

#>   x    y  z
#> 1 l    u NA
#> 2 m    b NA
#> 3 c    w NA
#> 4 x <NA> NA
#> 5 p    y NA

It implements the solution of reshaping data (from wide format to long and again to wide) and internally uses data.table functions. Thus it is significantly faster than standard solution using apply:

library(dedupewider)
library(microbenchmark)

df <- data.frame(x = c("l", "m", NA, NA, "p"),
                 y = c(NA, "b", "c", NA, NA),
                 z = c("u", NA, "w", "x", "y"))

df <- do.call(rbind, replicate(10000, df, simplify = FALSE))

apply_function <- function(df) {
  as.data.frame(t(apply(df, 1, function(x) c(x[!is.na(x)], x[is.na(x)]))))
}

microbenchmark(apply_function(df), na_move(df))

#> Unit: milliseconds
#>                expr      min       lq      mean    median       uq      max
#>  apply_function(df) 289.2032 361.0178 475.65281 425.79355 545.6405 999.4086
#>         na_move(df)  51.0419  58.1426  75.32407  65.01445  92.8706 216.6384
like image 4
gss Avatar answered Nov 02 '22 16:11

gss