Imagine the following data frame:
# ID v1 v2 v3 v4
#1 H 0 0 d 0
#2 I 0 0 0 0
#3 J d 0 0 0
#4 K 0 0 0 d
#5 L 0 d 0 0
There is either one or no d
per row.
For each row, I want to convert everything after d
to NA
. Desired result:
# ID v1 v2 v3 v4
#1 H 0 0 d NA
#2 I 0 0 0 0
#3 J d NA NA NA
#4 K 0 0 0 d
#5 L 0 d NA NA
DATA
df <- data.frame(ID = LETTERS[8:12],
v1 = c(0, 0, 'd', 0, 0),
v2 = c(0, 0, 0, 0, 'd'),
v3 = c('d', 0, 0, 0, 0),
v4 = c(0, 0, 0, 'd', 0),
stringsAsFactors = FALSE)
Using R replace() function to update 0 with NA R has a built-in function called replace() that replaces values in a vector with another value, for example, zeros with NAs.
replace() function in R Language is used to replace the values in the specified string vector x with indices given in list by those given in values. It takes on three parameters first is the list name, then the index at which the element needs to be replaced, and the third parameter is the replacement values.
To replace a column value in R use square bracket notation df[] , By using this you can update values on a single column or on all columns. To refer to a single column use df$column_name .
Using cummax
:
ix = df == "d"
df[t(apply(ix, 1, cummax)) & !ix] = NA
# ID v1 v2 v3 v4
# 1 H 0 0 d <NA>
# 2 I 0 0 0 0
# 3 J d <NA> <NA> <NA>
# 4 K 0 0 0 d
# 5 L 0 d <NA> <NA>
To increase speed, replace apply
with collapse::dapply
:
ix = df == "d"
df[collapse::dapply(ix, cummax, MARGIN = 1) & !ix] = NA
Or use matrixStats::rowCummaxs
:
ix = df == "d"
df[rowCummaxs(ix) & !ix] = NA
For pre-0.62.0 matrixStats
, see previous revision.
Two alternative solutions:
# option 1
w <- which(df == "d", arr.ind = TRUE)
w <- w[w[,2] < ncol(df),]
reps <- ncol(df) - w[,2]
w <- w[rep(1:nrow(w), reps),]
w[,2] <- w[,2] + unlist(sapply(reps, seq))
df[w] <- NA
# option 2
mc <- ncol(df) - max.col(df == "d", ties.method = "first")
mc[mc >= (ncol(df) - 1)] <- 0
rr <- rep(seq_along(mc), mc)
cc <- rep(ncol(df) - mc, mc) + unlist(sapply(mc, seq)[mc > 0])
df[cbind(rr, cc)] <- NA
which both also give the desired result.
My version for solving it is:
f1 <- function(x){
i1 <- which(x == 'd') + 1
cond <- length(i1) > 0 && i1 <= length(x)
if (cond){x[i1:(length(x))] <- NA;x}else{x}
}
df[-1] <- t(apply(df[-1], 1, f1))
which gives,
# ID v1 v2 v3 v4
#1 H 0 0 d <NA>
#2 I 0 0 0 0
#3 J d <NA> <NA> <NA>
#4 K 0 0 0 d
#5 L 0 d <NA> <NA>
Here are two base R one-liners.
1) Reduce Because this operates on entire columns at a time instead of row by row it should be particularly fast if there are many rows and not many columns.
replace(df, TRUE, Reduce(function(x, y) ifelse(x == "d", NA, y), df, acc = TRUE))
giving:
ID v1 v2 v3 v4
1 H 0 0 d <NA>
2 I 0 0 0 0
3 J d <NA> <NA> <NA>
4 K 0 0 0 d
5 L 0 d <NA> <NA>
2) read.table This assumes that the only occurrences of d
are in cells consisting of a single d
(which is the case for the example in the question).
replace(df, df!="d"&is.na(read.table(text=do.call(paste,df), comment="d", fill=NA)), NA)
giving:
ID v1 v2 v3 v4
1 H 0 0 d <NA>
2 I 0 0 0 0
3 J d <NA> <NA> <NA>
4 K 0 0 0 d
5 L 0 d <NA> <NA>
Another version using col
and max.col
:
df[-1][col(df[-1]) > max.col(df[-1] == "d", "last")] <- NA
df
# ID v1 v2 v3 v4
#1 H 0 0 d <NA>
#2 I 0 0 0 0
#3 J d <NA> <NA> <NA>
#4 K 0 0 0 d
#5 L 0 d <NA> <NA>
Some alternative with data.table:
library(data.table)
setDT(df)
df[, names(df)[-1] := {x <- unlist(.SD)
if(any(x=="d")) { # if there's no "d", no need to do anything
whd <- which(x=="d")
if(whd != length(x)) { # if "d" is at the end, nothing to be done either
x[(whd+1):length(x)] <- NA
}
}
as.list(x)}, # return the line as a list so the structure is kept
.SDcols=-1, by=1:nrow(df)] # you need to do a "by row" operation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With