Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create new variable defined only for a subset of the data using `dplyr`

Tags:

r

dplyr

Consider this example data:

set.seed(1234567)
mydf <- data.frame(var1 = runif(10), var2 = c(runif(5), rep(NA, 5)))

And this EXAMPLE vectorized function that, unfortunately, triggers an error whenever one of the arguments is NA

myfn <- function(x, y){
    sum(x:y)
}
myfn <- Vectorize(myfn)

Now, in the middle of a dplyr chain I need to create a new variable using myfn. This new var (var3) is only defined when var1 and var2 are not NA.

So the most common solution for similar situations is to use ifelse. Something like this.

mydf %>%
    mutate(var3 = ifelse(
        test = is.na(var2), 
        yes = NA, 
        no = myfn(var1, var2)))

But this does not work in my case because ifelse anyway actually passes the whole vector var1 and var2 to myfn and not just the sub-vector when test is FALSE. And it all breaks because myfn breaks whenever receives a NA.

So, what is the clever dplyr solution for this? (I can think of many solutions for this without using dplyr, but I am just interested in a dplyr-friendly solution)

It occurred to me that filter could help and indeed works with a very readable and dplyry code

mydf %>%
    filter(!is.na(var2)) %>%
    mutate(var3 = myfn(var1, var2))

        var1       var2       var3
1 0.56226084 0.62588794 0.56226084
2 0.72649850 0.24145251 0.72649850
3 0.91524985 0.03768974 0.91524985
4 0.02969437 0.51659297 0.02969437
5 0.76750970 0.81845788 0.76750970

But then I would have to save this in a temporary object, then create var3 in the original data all with NA and put all back together in the same data ('cause as far as I know the unfilter that some have suggested does not exist, ..., yet).

So just to illustrate the output I want, this code produces it (no using dplyr at all):

mydf$var3 <- NA
index <- !is.na(mydf$var2)
mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
mydf

> mydf
         var1       var2       var3
1  0.56226084 0.62588794 0.56226084
2  0.72649850 0.24145251 0.72649850
3  0.91524985 0.03768974 0.91524985
4  0.02969437 0.51659297 0.02969437
5  0.76750970 0.81845788 0.76750970
6  0.48005398         NA         NA
7  0.08837960         NA         NA
8  0.86294587         NA         NA
9  0.49660306         NA         NA
10 0.85350403         NA         NA

EDIT:

I accepted @krlmlr's solution because it is what I was looking for: clear, easily readable and concise code that effortlessly integrate in a dplyr chain. For my example, this solution looks like this.

mydf %>%
        rowwise %>%
        mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))

However, as @krlmlr pointed out in his answer, to operate row by row has a cost in terms of performance. It may not be significant for small data sets or single time operations, but for larger data sets or repeating the operation millions of times, it could be considerable. To illustrate, here's a comparison using microbenchmark and three solutions (base, dyplr and data.table) applied over a somewhat larger data set (not massive or anything, just 1000 rows instead of 10 in my original example).

library(data.table)
library(dplyr)

set.seed(1234567)
mydf <- data.frame(var1 = runif(1000), var2 = c(runif(500), rep(NA, 500)))

myfn <- function(x, y){
    sum(x:y)
}
myfn <- Vectorize(myfn)

using_base <- function(){
    mydf$var3 <- NA
    index <- !is.na(mydf$var2)
    mydf$var3[index] <- myfn(mydf$var1[index], mydf$var2[index])
}

using_dplyr <- function(){
    mydf <- mydf %>%
        rowwise %>%
        mutate(var3 = if(is.na(var2)) NA else myfn(var1, var2))
}

using_datatable <- function(){
    setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
}

library(microbenchmark)
mbm <- microbenchmark(
    using_base(), using_dplyr(), using_datatable(), 
    times = 1000)

library(ggplot2)
autoplot(mbm)

enter image description here

And as you can see, the dplyr solution using rowwise is considerably slower than its base and data.table rivals.

like image 307
elikesprogramming Avatar asked Apr 03 '16 22:04

elikesprogramming


2 Answers

You could perhaps consider using data.table, given that dplyr currently does not support in-place mutation, which is what you seem to be looking for.

library(data.table)
setDT(mydf)[!is.na(var2), var3 := myfn(var1, var2)]
#        var1       var2       var3
# 1: 0.56226084 0.62588794 0.56226084
# 2: 0.72649850 0.24145251 0.72649850
# 3: 0.91524985 0.03768974 0.91524985
# 4: 0.02969437 0.51659297 0.02969437
# 5: 0.76750970 0.81845788 0.76750970
# 6: 0.48005398         NA         NA
# 7: 0.08837960         NA         NA
# 8: 0.86294587         NA         NA
# 9: 0.49660306         NA         NA
#10: 0.85350403         NA         NA
like image 187
mtoto Avatar answered Oct 12 '22 15:10

mtoto


If your original function isn't vectorized and cannot cope with certain inputs, there's no performance benefit in vectorizing it using Vectorize(). Instead, use dplyr::rowwise() to operate row by row:

iris %>%
  rowwise %>%
  mutate(x = if (Sepal.Length < 5) 1 else NA) %>%
  ungroup

Note that using if here is perfectly safe, as the input has length 1.

like image 21
krlmlr Avatar answered Oct 12 '22 15:10

krlmlr