Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fit a different model for each row of a list-columns data frame

Tags:

What is the best way to fit different model formulae that vary by the row of a data frame with the list-columns data structure in tidyverse?

In R for Data Science, Hadley presents a terrific example of how to use the list-columns data structure and fit many models easily (http://r4ds.had.co.nz/many-models.html#gapminder). I am trying to find a way to fit many models with slightly different formulae. In the below example adapted from his original example, what is the best way to fit a different model for each continent?

library(gapminder)
library(dplyr)
library(tidyr)
library(purrr)
library(broom)

by_continent <- gapminder %>% 
  group_by(continent) %>% 
  nest()

by_continent <- by_continent %>% 
  mutate(model = map(data, ~lm(lifeExp ~ year, data = .)))

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)

## A tibble: 5 × 12
#  continent r.squared adj.r.squared     sigma statistic      p.value    df
#     <fctr>     <dbl>         <dbl>     <dbl>     <dbl>        <dbl> <int>
#1      Asia 0.4356350     0.4342026 8.9244419  304.1298 6.922751e-51     2
#2    Europe 0.4984659     0.4970649 3.8530964  355.8099 1.344184e-55     2
#3    Africa 0.2987543     0.2976269 7.6685811  264.9929 6.780085e-50     2
#4  Americas 0.4626467     0.4608435 6.8618439  256.5699 4.354220e-42     2
#5   Oceania 0.9540678     0.9519800 0.8317499  456.9671 3.299327e-16     2
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
##   deviance <dbl>, df.residual <int>

I know I can do it by iterating through by_continent (not efficient as it estimates each model for every continent:

formulae <- list(
  Asia=~lm(lifeExp ~ year, data = .),
  Europe=~lm(lifeExp ~ year + pop, data = .),
  Africa=~lm(lifeExp ~ year + gdpPercap, data = .),
  Americas=~lm(lifeExp ~ year - 1, data = .),
  Oceania=~lm(lifeExp ~ year + pop + gdpPercap, data = .)
)

for (i in 1:nrow(by_continent)) {
  by_continent$model[[i]] <- map(by_continent$data, formulae[[i]])[[i]]
}

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)

## A tibble: 5 × 12
#  continent r.squared adj.r.squared     sigma  statistic       p.value    df
#     <fctr>     <dbl>         <dbl>     <dbl>      <dbl>         <dbl> <int>
#1      Asia 0.4356350     0.4342026 8.9244419   304.1298  6.922751e-51     2
#2    Europe 0.4984677     0.4956580 3.8584819   177.4093  3.186760e-54     3
#3    Africa 0.4160797     0.4141991 7.0033542   221.2506  2.836552e-73     3
#4  Americas 0.9812082     0.9811453 8.9703814 15612.1901 4.227928e-260     1
#5   Oceania 0.9733268     0.9693258 0.6647653   243.2719  6.662577e-16     4
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
##   deviance <dbl>, df.residual <int>

But is it possible to do this without following back to loop in base R (and avoiding fitting models I don't need)?

What I tried is something like this:

by_continent <- by_continent %>% 
left_join(tibble::enframe(formulae, name="continent", value="formula"))

by_continent %>% 
   mutate(model=map2(data, formula, est_model))

But I don't seem to be able to come up with an est_model function that works. I tried this function (h/t: https://gist.github.com/multidis/8138757) that doesn't work:

  est_model <- function(data, formula, ...) {
  mc <- match.call()
  m <- match(c("formula","data"), names(mc), 0L)
  mf <- mc[c(1L, m)]
  mf[[1L]] <- as.name("model.frame")
  mf <- eval(mf, parent.frame())
  data.st <- data.frame(mf)

  return(data.st)
}

(Admittedly, this is a contrived example. My actual case is that I have substantial observations missing key independent variables in my data, so I want to fit one model with all variables on complete observations and another with only a subset of the variables on the rest observations.)

UPDATE

I came up with an est_model function that works (though probably not efficient):

est_model <- function(data, formula, ...) {
  map(list(data), formula, ...)[[1]]
}

by_continent <- by_continent %>% 
   mutate(model=map2(data, formula, est_model))

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)

## A tibble: 5 × 12
#  continent r.squared adj.r.squared     sigma  statistic       p.value    df
#      <chr>     <dbl>         <dbl>     <dbl>      <dbl>         <dbl> <int>
#1      Asia 0.4356350     0.4342026 8.9244419   304.1298  6.922751e-51     2
#2    Europe 0.4984677     0.4956580 3.8584819   177.4093  3.186760e-54     3
#3    Africa 0.4160797     0.4141991 7.0033542   221.2506  2.836552e-73     3
#4  Americas 0.9812082     0.9811453 8.9703814 15612.1901 4.227928e-260     1
#5   Oceania 0.9733268     0.9693258 0.6647653   243.2719  6.662577e-16     4
## ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, deviance <dbl>,
##   df.residual <int>
like image 799
LmW. Avatar asked Dec 30 '16 23:12

LmW.


People also ask

What does rowwise () do in R?

rowwise() allows you to compute on a data frame a row-at-a-time. This is most useful when a vectorised function doesn't exist. Most dplyr verbs preserve row-wise grouping. The exception is summarise() , which return a grouped_df.

How do you specify rows and columns in R?

To select a specific column, you can also type in the name of the dataframe, followed by a $ , and then the name of the column you are looking to select. In this example, we will be selecting the payment column of the dataframe. When running this script, R will simplify the result as a vector.

How do I create a Dataframe from two lists in R?

The expand. grid function create a data frame from all combinations of the provided lists or vectors or factors. For example, if we have two lists defined as List1 and List2 then we can create a data frame using the code expand. grid(List1,List2).

How do you Unnest in R?

The tidyr package in R is used to “tidy” up the data. The unnest() method in the package can be used to convert the data frame into an unnested object by specifying the input data and its corresponding columns to use in unnesting. The output is produced in the form of a tibble in R.


2 Answers

I find it is easier to make a list of model formula. each model was only fit once for the corresponding continent. I add a new column formula to the nested data to make sure that the formula and the continent are in the same order in case they are not.

formulae <- c(
    Asia= lifeExp ~ year,
    Europe= lifeExp ~ year + pop,
    Africa= lifeExp ~ year + gdpPercap,
    Americas= lifeExp ~ year - 1,
    Oceania= lifeExp ~ year + pop + gdpPercap
)

df <- gapminder %>%
    group_by(continent) %>%
    nest() %>%
    mutate(formula = formulae[as.character(continent)]) %>%
    mutate(model = map2(formula, data, ~ lm(.x, .y))) %>%
    mutate(glance=map(model, glance)) %>%
    unnest(glance, .drop=T)

# # A tibble: 5 × 12
#   continent r.squared adj.r.squared     sigma  statistic       p.value    df      logLik        AIC        BIC
#      <fctr>     <dbl>         <dbl>     <dbl>      <dbl>         <dbl> <int>       <dbl>      <dbl>      <dbl>
# 1      Asia 0.4356350     0.4342026 8.9244419   304.1298  6.922751e-51     2 -1427.65947 2861.31893 2873.26317
# 2    Europe 0.4984677     0.4956580 3.8584819   177.4093  3.186760e-54     3  -995.41016 1998.82033 2014.36475
# 3    Africa 0.4160797     0.4141991 7.0033542   221.2506  2.836552e-73     3 -2098.46089 4204.92179 4222.66639
# 4  Americas 0.9812082     0.9811453 8.9703814 15612.1901 4.227928e-260     1 -1083.35918 2170.71836 2178.12593
# 5   Oceania 0.9733268     0.9693258 0.6647653   243.2719  6.662577e-16     4   -22.06696   54.13392   60.02419
# # ... with 2 more variables: deviance <dbl>, df.residual <int>
like image 70
mt1022 Avatar answered Sep 21 '22 10:09

mt1022


I found purrr::modify_depth() that does what I want to do with est_model() in the original question. This is the solution I am now happy with:

library(gapminder)
library(tidyverse)
library(purrr)
library(broom)

fmlas <- tibble::tribble(
  ~continent, ~formula,
  "Asia", ~lm(lifeExp ~ year, data = .),
  "Europe", ~lm(lifeExp ~ year + pop, data = .),
  "Africa", ~lm(lifeExp ~ year + gdpPercap, data = .),
  "Americas", ~lm(lifeExp ~ year - 1, data = .),
  "Oceania", ~lm(lifeExp ~ year + pop + gdpPercap, data = .)
)

by_continent <- gapminder %>% 
  nest(-continent) %>%
  left_join(fmlas) %>%
  mutate(model=map2(data, formula, ~modify_depth(.x, 0, .y)))

by_continent %>% 
  mutate(glance=map(model, glance)) %>% 
  unnest(glance, .drop=T)
like image 37
LmW. Avatar answered Sep 24 '22 10:09

LmW.