Consider this example data:
set.seed(1234567)
mydf <- data.frame(var1 = runif(10), var2 = c(runif(5), rep(NA, 5)))
And this EXAMPLE vectorized function that, unfortunately, triggers an error whenever one of the arguments is NA
myfn <- function(x, y){
sum(x:y)
}
myfn <- Vectorize(myfn)
Now, in the middle of a dplyr
chain I need to create a new variable using myfn
. This new var (var3
) is only defined when var1
and var2
are not NA
.
So the most common solution for similar situations is to use ifelse
. Something like this.
mydf %>%
mutate(var3 = ifelse(
test = is.na(var2),
yes = NA,
no = myfn(var1, var2)))
But this does not work in my case because ifelse
anyway actually passes the whole vector var1
and var2
to myfn
and not just the sub-vector when test
is FALSE
. And it all breaks because myfn
breaks whenever receives a NA
.
So, what is the clever dplyr
solution for this? (I can think of many solutions for this without using dplyr
, but I am just interested in a dplyr
-friendly solution)
It occurred to me that filter
could help and indeed works with a very readable and dplyr
y code
mydf %>%
filter(!is.na(var2)) %>%
mutate(var3 = myfn(var1, var2))
var1 var2 var3
1 0.56226084 0.62588794 0.56226084
2 0.72649850 0.24145251 0.72649850
3 0.91524985 0.03768974 0.91524985
4 0.02969437 0.51659297 0.02969437
5 0.76750970 0.81845788 0.76750970
But then I would have to save this in a temporary object, then create var3
in the original data all with NA
and put all back together in the same data ('cause as far as I know the unfilter
that some have suggested does not exist, ..., yet).
So just to illustrate the output I want, this code produces it (no using dplyr
at all):
mydf$var3 <- NA
index <- !is.na(mydf$var2)
mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
mydf
> mydf
var1 var2 var3
1 0.56226084 0.62588794 0.56226084
2 0.72649850 0.24145251 0.72649850
3 0.91524985 0.03768974 0.91524985
4 0.02969437 0.51659297 0.02969437
5 0.76750970 0.81845788 0.76750970
6 0.48005398 NA NA
7 0.08837960 NA NA
8 0.86294587 NA NA
9 0.49660306 NA NA
10 0.85350403 NA NA
EDIT:
I accepted @krlmlr's solution because it is what I was looking for: clear, easily readable and concise code that effortlessly integrate in a dplyr
chain. For my example, this solution looks like this.
mydf %>%
rowwise %>%
mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))
However, as @krlmlr pointed out in his answer, to operate row by row has a cost in terms of performance. It may not be significant for small data sets or single time operations, but for larger data sets or repeating the operation millions of times, it could be considerable. To illustrate, here's a comparison using microbenchmark
and three solutions (base, dyplr and data.table) applied over a somewhat larger data set (not massive or anything, just 1000 rows instead of 10 in my original example).
library(data.table)
library(dplyr)
set.seed(1234567)
mydf <- data.frame(var1 = runif(1000), var2 = c(runif(500), rep(NA, 500)))
myfn <- function(x, y){
sum(x:y)
}
myfn <- Vectorize(myfn)
using_base <- function(){
mydf$var3 <- NA
index <- !is.na(mydf$var2)
mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
}
using_dplyr <- function(){
mydf <- mydf %>%
rowwise %>%
mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))
}
using_datatable <- function(){
setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
}
library(microbenchmark)
mbm <- microbenchmark(
using_base(), using_dplyr(), using_datatable(),
times = 1000)
library(ggplot2)
autoplot(mbm)
And as you can see, the dplyr
solution using rowwise
is considerably slower than its base
and data.table
rivals.
You could perhaps consider using data.table
, given that dplyr
currently does not support in-place mutation, which is what you seem to be looking for.
library(data.table)
setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
# var1 var2 var3
# 1: 0.56226084 0.62588794 0.56226084
# 2: 0.72649850 0.24145251 0.72649850
# 3: 0.91524985 0.03768974 0.91524985
# 4: 0.02969437 0.51659297 0.02969437
# 5: 0.76750970 0.81845788 0.76750970
# 6: 0.48005398 NA NA
# 7: 0.08837960 NA NA
# 8: 0.86294587 NA NA
# 9: 0.49660306 NA NA
#10: 0.85350403 NA NA
If your original function isn't vectorized and cannot cope with certain inputs, there's no performance benefit in vectorizing it using Vectorize()
. Instead, use dplyr::rowwise()
to operate row by row:
iris %>%
rowwise %>%
mutate(x = if (Sepal.Length < 5) 1 else NA) %>%
ungroup
Note that using if
here is perfectly safe, as the input has length 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With