How can I find out which group failed when using group_by
in a dplyr
type chain. Take for example:
library(dplyr)
data(iris)
iris %>%
group_by(Species) %>%
do(mod=lm(Petal.Length ~ Petal.Width, data = .)) %>%
mutate(Slope = summary(mod)$coeff[2])
Works fine. Now if I add some problem data to iris
:
iris$Petal.Width[iris$Species=="versicolor"]= NA
Such that it fails when trying to run a linear model:
iris_sub <- iris[iris$Species=="versicolor",]
lm(Petal.Length ~ Petal.Width, data = iris_sub)
But if I was approaching this blind with a massive dataset if I ran this:
iris %>%
group_by(Species) %>%
do(mod=lm(Petal.Length ~ Petal.Width, data = .)) %>%
mutate(Slope = summary(mod)$coeff[2])
This error message wouldn't help me find at which level the model failed:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases
I could use a loop like below. This at least lets me know which level of Species
the function fails on. However, I'd much prefer to use a dplyr setup:
lmdf <- c()
for (i in unique(iris$Species)) {
cat(i, "\n")
u <- iris %>%
filter(Species==i) %>%
do(mod=lm(Petal.Length ~ Petal.Width, data = .))
lmdf = rbind(lmdf, u)
}
Any suggestions regarding a better way to achieve this? To summarise, I am trying to use a dplyr
type framework to determine at which group level, a function fails.
The tryCatch
solution referenced here no longer seems to work. I get this error:
Error in tryCatch(lm(v3 ~ v4, df), error = if (e$message == all_na_msg) default else stop(e)) : object 'e' not found
A full example of using purrr::safely
:
library(tidyverse)
data(iris)
iris$Petal.Width[iris$Species == "versicolor"] <- NA
If you're uninterested in the actual error (i.e. that the cause was 0 (non-NA) cases
), you can do:
iris %>%
group_by(Species) %>%
do(mod = safely(lm)(Petal.Length ~ Petal.Width, data = .)$result) %>%
mutate(Slope = ifelse(!is.null(mod), summary(mod)$coeff[2], NA))
And we're done!
Source: local data frame [3 x 3]
Groups: <by row>
# A tibble: 3 × 3
Species mod Slope
<fctr> <list> <dbl>
1 setosa <S3: lm> 0.5464903
2 versicolor <NULL> NA
3 virginica <S3: lm> 0.6472593
We can clearly see which group failed (because it has NULL
instead of a model, and it's Slope
is unknown). Moreover, we still obtained the correct models and slopes for the other groups, so we haven't wasted computation time (this can be quite nice when running complex models on larger datasets).
step1 <- iris %>%
group_by(Species) %>%
do(res = safely(lm)(Petal.Length ~ Petal.Width, data = .)) %>%
mutate(err = map(list(res), 'error'),
mod = map(list(res), 'result'))
Unfortunately, we have to use extra list
calls there, not entirely sure why. Alternatively, you can ungroup
first.
To see which (if any) groups have error-ed, we can use:
filter(step1, !is.null(err))
To rescue the result of the non-errored groups simply filter
first:
step1 %>%
filter(is.null(err)) %>%
mutate(Slope = summary(mod)$coeff[2])
You could also look into the broom
package if want to obtain the coefficients of models within tidy chains.
If you're not wedded to dplyr
, you could use a split-apply approach and try
in base R. Here's one way to do that:
# use split to make a list of data sets by group (here, species)
iris.split <- split(iris, iris$Species)
# iterate your modeling function over that list, using 'try' to let the
# process keep running when an error is thrown and logging an object of
#class "try-error" in that slot on the resulting list
iris.mods <- lapply(iris.split, function(i) try(lm(Petal.Length ~ Petal.Width, data = i)))
# get a vector of slopes from those models with NA where any errors killed
# the modeling process
slopes <- sapply(iris.mods, function(x) ifelse(is(x, "try-error"), NA, x$coefficients[2]))
Result:
> slopes
setosa versicolor virginica
0.5464903 NA 0.6472593
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With