There are many NA's in my dataset and I need to shift all those cells (at row level) to the left.
Example- my dataframe:
df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
df
x y z
1 l <NA> u
2 m b <NA>
3 <NA> c w
4 <NA> <NA> x
5 p <NA> y
I want the above dataframe converted into this:
x y z
1 l u NA
2 m b NA
3 c w NA
4 x <NA> NA
5 p y NA
Please help.
Thanks.
There have been a number of duplicate questions (here and here) since this one was asked. I have collected (and improved) some of the more idiomatic answers and benchmarked them against my own Rcpp
implementation.
For simplicity, I have compared functions that take as input and return as output a character matrix, not a data frame containing only character variables. You can always coerce from one to the other with as.matrix
and as.data.frame
(see bottom for example).
Rcpp::sourceCpp(code = '
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
void shift_na_in_place(CharacterMatrix x)
{
int m = x.nrow();
int n = x.ncol();
for (int i = 0, k = 0, k0 = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
if (x[k] != NA_STRING) {
x[k0] = x[k];
k0 += m;
}
k += m;
}
while (k0 < k) {
x[k0] = NA_STRING;
k0 += m;
}
k = (k % m) + 1;
k0 = k;
}
if (x.attr("dimnames") != R_NilValue) {
List dn = x.attr("dimnames");
dn[1] = R_NilValue;
if (dn.attr("names") != R_NilValue) {
CharacterVector ndn = dn.attr("names");
ndn[1] = "";
}
}
}
// [[Rcpp::export]]
CharacterMatrix shift_na(CharacterMatrix x)
{
CharacterMatrix y = clone(x);
shift_na_in_place(y);
return y;
}
')
f1 <- function(x) {
t(apply(x, 1L, function(y) {r <- is.na(y); c(y[!r], y[r])}))
}
f2 <- function(x) {
t(apply(x, 1L, function(y) y[order(is.na(y), method = "radix")]))
}
f3 <- function(x) {
d <- dim(x)
dn <- dimnames(x)
matrix(x[order(row(x), is.na(x), method = "radix")],
nrow = d[1L], ncol = d[2L], byrow = TRUE,
dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
f4 <- function(x) {
d <- dim(x)
dn <- dimnames(x)
matrix(x[order(is.na(x) + (row(x) - 1L) * 2L + 1L, method = "radix")],
nrow = d[1L], ncol = d[2L], byrow = TRUE,
dimnames = if (!is.null(dn)) c(dn[1L], list(NULL)))
}
set.seed(1L)
m <- 1e+05L
n <- 10L
x <- sample(c(letters, NA), size = m * n, replace = TRUE, prob = c(rep(1, 26), 13))
dim(x) <- c(m, n)
microbenchmark::microbenchmark(shift_na(x), f1(x), f2(x), f3(x), f4(x), check = "identical")
Unit: milliseconds
expr min lq mean median uq max neval
shift_na(x) 10.04959 10.32019 10.82935 10.41968 10.60104 22.69412 100
f1(x) 141.95959 150.83875 180.49025 167.01266 211.52478 248.07587 100
f2(x) 722.27211 759.75710 780.69368 773.26920 797.01253 857.07905 100
f3(x) 18.45201 19.15436 22.47760 21.59577 22.40543 66.47121 100
f4(x) 30.03168 31.62765 35.22960 33.92801 35.06384 85.92661 100
The dedicated Rcpp
implementation shift_na
is fastest, as you might expect, but f3
and f4
are not much slower. A few finer points:
f1
and f2
call apply
, which is built on an R for
loop, so it is not surprising that they are slow.
f3
and f4
have to allocate memory for is.na(x)
and row(x)
, which could be a hindrance for large enough x
.
f3
is faster than f4
because the "radix"
sort uses a faster algorithm when the range (maximum minus minimum) of the integer vector being sorted is less than 100000 (see ?sort
). Here, the ranges are:
is.na(x): 1
row(x): 99999
is.na(x) + (row(x) - 1L) * 2L + 1L: 199999
shift_na(x)
creates a copy of x
and modifies the copy in place. If you cannot or do not want to allocate memory for a copy because x
is very large, then you can do shift_na_in_place(x)
to modify x
in place.
shift_na_in_place
should be preferred over shift_na
if you have a data frame data
containing character variables, rather than a character matrix. In this situation, there is no need to preserve the intermediate as.matrix(data)
; it can be modified in place:
x <- as.matrix(data)
shift_na_in_place(x)
newdata <- as.data.frame(x)
You can use the standard apply
function:
df=data.frame(x=c("l","m",NA,NA,"p"),y=c(NA,"b","c",NA,NA),z=c("u",NA,"w","x","y"))
df2 = as.data.frame(t(apply(df,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
colnames(df2) = colnames(df)
> df
x y z
1 l <NA> u
2 m b <NA>
3 <NA> c w
4 <NA> <NA> x
5 p <NA> y
> df2
x y z
1 l u <NA>
2 m b <NA>
3 c w <NA>
4 x <NA> <NA>
5 p y <NA>
Thanks to @Richard Scriven for good observation
A) with is.na
and order
, lapply
and rbind
for aggregation
nosort.df<-do.call(rbind,lapply(1:nrow(df),function(x) { z=df[x,][order(is.na(df[x,]))];colnames(z)<-c("x","y","z");return(z) } ))
> nosort.df
x y z
1 l u <NA>
2 m b <NA>
3 c w <NA>
4 x <NA> <NA>
5 p y <NA>
B) if sorted rows are required:
with sort
, lapply
and rbind
sort.df<-do.call(rbind,lapply(1:nrow(df),function(x) { z=sort(df[x,],na.last=TRUE);colnames(z)<-c("x","y","z");return(z) } ))
> sort.df
x y z
1 l u <NA>
2 b m <NA>
3 c w <NA>
4 x <NA> <NA>
5 p y <NA>
I have included a function for this task in my package dedupewider
(available on CRAN). It allows to move NA
to right, left or even top and bottom:
library(dedupewider)
df <- data.frame(x = c("l", "m", NA, NA, "p"),
y = c(NA, "b", "c", NA, NA),
z = c("u", NA, "w", "x", "y"))
na_move(df) # 'right' direction is by default
#> x y z
#> 1 l u NA
#> 2 m b NA
#> 3 c w NA
#> 4 x <NA> NA
#> 5 p y NA
It implements the solution of reshaping data (from wide format to long and again to wide) and internally uses data.table
functions. Thus it is significantly faster than standard solution using apply
:
library(dedupewider)
library(microbenchmark)
df <- data.frame(x = c("l", "m", NA, NA, "p"),
y = c(NA, "b", "c", NA, NA),
z = c("u", NA, "w", "x", "y"))
df <- do.call(rbind, replicate(10000, df, simplify = FALSE))
apply_function <- function(df) {
as.data.frame(t(apply(df, 1, function(x) c(x[!is.na(x)], x[is.na(x)]))))
}
microbenchmark(apply_function(df), na_move(df))
#> Unit: milliseconds
#> expr min lq mean median uq max
#> apply_function(df) 289.2032 361.0178 475.65281 425.79355 545.6405 999.4086
#> na_move(df) 51.0419 58.1426 75.32407 65.01445 92.8706 216.6384
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With