Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tracking which group fails in a dplyr chain

Tags:

r

dplyr

How can I find out which group failed when using group_by in a dplyr type chain. Take for example:

library(dplyr)

data(iris)

iris %>%
  group_by(Species) %>%
  do(mod=lm(Petal.Length ~ Petal.Width, data = .)) %>%
  mutate(Slope = summary(mod)$coeff[2]) 

Works fine. Now if I add some problem data to iris:

iris$Petal.Width[iris$Species=="versicolor"]= NA

Such that it fails when trying to run a linear model:

iris_sub <- iris[iris$Species=="versicolor",]
lm(Petal.Length ~ Petal.Width, data = iris_sub)

But if I was approaching this blind with a massive dataset if I ran this:

iris %>%
  group_by(Species) %>%
  do(mod=lm(Petal.Length ~ Petal.Width, data = .)) %>%
  mutate(Slope = summary(mod)$coeff[2]) 

This error message wouldn't help me find at which level the model failed:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases

I could use a loop like below. This at least lets me know which level of Species the function fails on. However, I'd much prefer to use a dplyr setup:

lmdf <- c()
for (i in unique(iris$Species)) {
  cat(i, "\n")
  u <- iris %>%
    filter(Species==i) %>%
    do(mod=lm(Petal.Length ~ Petal.Width, data = .))
  lmdf = rbind(lmdf, u)
}

Any suggestions regarding a better way to achieve this? To summarise, I am trying to use a dplyr type framework to determine at which group level, a function fails.

The tryCatch solution referenced here no longer seems to work. I get this error:

Error in tryCatch(lm(v3 ~ v4, df), error = if (e$message == all_na_msg) default else stop(e)) : object 'e' not found

like image 494
boshek Avatar asked Feb 16 '17 18:02

boshek


2 Answers

A full example of using purrr::safely:

Prep

library(tidyverse)
data(iris)
iris$Petal.Width[iris$Species == "versicolor"] <-  NA

Safely run the models

If you're uninterested in the actual error (i.e. that the cause was 0 (non-NA) cases), you can do:

iris %>%
  group_by(Species) %>%
  do(mod = safely(lm)(Petal.Length ~ Petal.Width, data = .)$result) %>% 
  mutate(Slope = ifelse(!is.null(mod), summary(mod)$coeff[2], NA))

And we're done!

Source: local data frame [3 x 3]
Groups: <by row>

# A tibble: 3 × 3
     Species      mod     Slope
      <fctr>   <list>     <dbl>
1     setosa <S3: lm> 0.5464903
2 versicolor   <NULL>        NA
3  virginica <S3: lm> 0.6472593

We can clearly see which group failed (because it has NULL instead of a model, and it's Slope is unknown). Moreover, we still obtained the correct models and slopes for the other groups, so we haven't wasted computation time (this can be quite nice when running complex models on larger datasets).

Keeping track of both the models and the errors

step1 <- iris %>%
  group_by(Species) %>%
  do(res = safely(lm)(Petal.Length ~ Petal.Width, data = .)) %>%
  mutate(err = map(list(res), 'error'),
         mod = map(list(res), 'result'))

Unfortunately, we have to use extra list calls there, not entirely sure why. Alternatively, you can ungroup first.

To see which (if any) groups have error-ed, we can use:

filter(step1, !is.null(err))

To rescue the result of the non-errored groups simply filter first:

step1 %>% 
  filter(is.null(err)) %>% 
  mutate(Slope = summary(mod)$coeff[2])

You could also look into the broom package if want to obtain the coefficients of models within tidy chains.

like image 140
Axeman Avatar answered Oct 23 '22 03:10

Axeman


If you're not wedded to dplyr, you could use a split-apply approach and try in base R. Here's one way to do that:

# use split to make a list of data sets by group (here, species)
iris.split <- split(iris, iris$Species)

# iterate your modeling function over that list, using 'try' to let the
# process keep running when an error is thrown and logging an object of
#class "try-error" in that slot on the resulting list
iris.mods <- lapply(iris.split, function(i) try(lm(Petal.Length ~ Petal.Width, data = i)))

# get a vector of slopes from those models with NA where any errors killed
# the modeling process
slopes <- sapply(iris.mods, function(x) ifelse(is(x, "try-error"), NA, x$coefficients[2]))

Result:

> slopes
    setosa versicolor  virginica 
 0.5464903         NA  0.6472593
like image 45
ulfelder Avatar answered Oct 23 '22 02:10

ulfelder