<p>Using the examples from Wickhams introduction to purrr in R for data science, I am trying to create a double nested list.</p> <pre class="prettyprint"><code>library(gapminder) library(purrr) library(tidyr) gapminder nest_data <- gapminder %>% group_by(continent) %>% nest(.key = by_continent) </code></pre> <p>How can I further nest the countries so that nest_data contains by_continent and a new level of nesting by_contry that ultimately includes the tibble by_year?</p> <p>Furthermore, after creating this datastructure for the gapminder data - how would you run the regression model examples from the bookchapter for each country?</p>

<p>My solution with some explanation below.</p> <pre class="prettyprint"><code>library(gapminder) library(purrr) library(tidyr) library(broom) nest_data <- gapminder %>% group_by(continent) %>% nest(.key = by_continent) </code></pre> <h3>First question was: how to nest by_country inside the nested by_continent</h3> <h3>Great solution by @aosmith on the comments</h3> <pre class="prettyprint"><code>nested_again<- nest_data %>% mutate(by_continent = map(by_continent, ~.x %>% group_by(country) %>% nest(.key = by_country))) # Level 1 nested_again # # A tibble: 5 × 2 # continent by_continent # <fctr> <list> # 1 Asia <tibble [33 × 2]> # 2 Europe <tibble [30 × 2]> # 3 Africa <tibble [52 × 2]> # 4 Americas <tibble [25 × 2]> # 5 Oceania <tibble [2 × 2]> # Level 2 nested_again %>% unnest %>% slice(1:2) # # A tibble: 2 × 3 # continent country by_country # <fctr> <fctr> <list> # 1 Asia Afghanistan <tibble [12 × 4]> # 2 Asia Bahrain <tibble [12 × 4]> </code></pre> <h3>Second question: how to apply a regression model at a deeper level <em>(and save the models on the tibble, I suppose)</em> </h3> <h3>Solution by @aosmith (which I'm calling sol1)</h3> <pre class="prettyprint"><code>sol1<-mutate(nested_again, models = map(by_continent, "by_country") %>% at_depth(2, ~lm(lifeExp ~ year, data = .x))) sol1 # # A tibble: 5 × 3 # continent by_continent models # <fctr> <list> <list> # 1 Asia <tibble [33 × 2]> <list [33]> # 2 Europe <tibble [30 × 2]> <list [30]> # 3 Africa <tibble [52 × 2]> <list [52]> # 4 Americas <tibble [25 × 2]> <list [25]> # 5 Oceania <tibble [2 × 2]> <list [2]> sol1 %>% unnest(models) # Error: Each column must either be a list of vectors or a list of data frames [models] sol1 %>% unnest(by_continent) %>% slice(1:2) # # A tibble: 2 × 3 # continent country by_country # <fctr> <fctr> <list> # 1 Asia Afghanistan <tibble [12 × 4]> # 2 Asia Bahrain <tibble [12 × 4]> </code></pre> <p>The solution is doing what it is supposed to, but there's no easy way to filter by country, because that information is nested in the level 2.</p> <p>I propose the solution 2, based on @aosmith's solution to the first question:</p> <pre class="prettyprint"><code>sol2<-nested_again %>% mutate(by_continent = map(by_continent, ~.x %>% mutate(models = map(by_country, ~lm(lifeExp ~ year, data = .x) )) )) sol2 # # A tibble: 5 × 2 # continent by_continent # <fctr> <list> # 1 Asia <tibble [33 × 4]> # 2 Europe <tibble [30 × 4]> # 3 Africa <tibble [52 × 4]> # 4 Americas <tibble [25 × 4]> # 5 Oceania <tibble [2 × 4]> sol2 %>% unnest %>% slice(1:2) # # A tibble: 2 × 4 # continent country by_country models # <fctr> <fctr> <list> <list> # 1 Asia Afghanistan <tibble [12 × 4]> <S3: lm> # 2 Asia Bahrain <tibble [12 × 4]> <S3: lm> sol2 %>% unnest %>% unnest(by_country) %>% colnames # [1] "continent" "country" "year" "lifeExp" "pop" # [6] "gdpPercap" # get model by specific country sol2 %>% unnest %>% filter(country == "Brazil") %$% models %>% extract2(1) # Call: # lm(formula = lifeExp ~ year, data = .x) # # Coefficients: # (Intercept) year # -709.9427 0.3901 # summary with broom::tidy sol2 %>% unnest %>% filter(country == "Brazil") %$% models %>% extract2(1) %>% tidy # term estimate std.error statistic p.value # 1 (Intercept) -709.9426860 10.801042821 -65.72909 1.617791e-14 # 2 year 0.3900895 0.005456243 71.49417 6.990433e-15 </code></pre> <p>We can tidy all the models and save in the data to use for plotting or filter</p> <pre class="prettyprint"><code>sol2 %<>% mutate(by_continent = map(by_continent, ~.x %>% mutate(tidymodels = map(models, tidy )) )) sol2 %>% unnest %>% unnest(tidymodels) %>% ggplot(aes(country,p.value,colour=continent))+geom_point()+ facet_wrap(~continent)+ theme(axis.text.x = element_blank()) </code></pre> <p><img src="https://i.stack.imgur.com/BBYyY.png" alt="Linear model p-value by country"></p> <pre class="prettyprint"><code>selc <- sol2 %>% unnest %>% unnest(tidymodels) %>% filter(p.value > 0.05) %>% select(country) %>% unique %>% extract2(1) gapminder %>% filter(country %in% selc ) %>% ggplot(aes(year,lifeExp,colour=continent))+geom_line(aes(group=country))+ facet_wrap(~continent) </code></pre> <p><img src="https://i.stack.imgur.com/8OLJ8.png" alt="Filtered by p-value"></p> <p>aaaaand, we can use the models</p> <pre class="prettyprint"><code>m1 <- sol2 %>% unnest %>% slice(1) %$% models %>% extract2(1) x <- sol2 %>% unnest %>% slice(1) %>% unnest(by_country) %>% select(year) pred1 <- data.frame(year = x, lifeExp = predict.lm(m1,x)) sol2 %>% unnest %>% slice(1) %>% unnest(by_country) %>% ggplot(aes(year, lifeExp )) + geom_point() + geom_line(data=pred1) </code></pre> <p><img src="https://i.stack.imgur.com/kmiDW.png" alt="using a model to predict"></p> <p>In this case there's really no good reason to use this double nesting (besides learning how to to it, of course), but I found a case in my work where it is extremely valuable, specifically when you need a function to work on a 3rd level, grouped by levels 1 and 2, and save in level 2 - of course for this we could also use a <code>for</code> loop on level 1, but what's the fun in that ;) I'm not really sure how this "nested" <code>map</code> performs compared to <code>for</code> loop + <code>map</code>, but I'll test it next.</p> <h3>Benchmark</h3> <p>It looks like they do not differ much</p> <pre class="prettyprint"><code># comparison map_map with for_map map_map<-function(nested_again){ nested_again %>% mutate(by_continent = map(by_continent, ~.x %>% mutate(models = map(by_country, ~lm(lifeExp ~ year, data = .x) )) )) } for_map<-function(nested_again){ for(i in 1:length(nested_again[[1]])){ nested_again$by_continent[[i]] %<>% mutate(models = map(by_country, ~lm(lifeExp ~ year, data = .x) )) }} res<-microbenchmark::microbenchmark( mm<-map_map(nested_again), fm<-for_map(nested_again) ) res # Unit: milliseconds # expr min lq mean median uq max neval cld # mm <- map_map(nested_again) 121.0033 144.5530 160.6785 155.2389 174.2915 240.2012 100 a # fm <- for_map(nested_again) 131.4312 148.3329 164.7097 157.6589 173.6480 455.7862 100 a autoplot(res) </code></pre> <p><img src="https://i.stack.imgur.com/jx1Sx.png" alt="enter image description here"></p>

Double nesting in the tidyverse

Tags:

r

purrr

tidyr

Using the examples from Wickhams introduction to purrr in R for data science, I am trying to create a double nested list.

library(gapminder)
library(purrr)
library(tidyr)
gapminder
nest_data <- gapminder %>% group_by(continent) %>% nest(.key = by_continent)

How can I further nest the countries so that nest_data contains by_continent and a new level of nesting by_contry that ultimately includes the tibble by_year?

Furthermore, after creating this datastructure for the gapminder data - how would you run the regression model examples from the bookchapter for each country?

853

asked Aug 30 '16 13:08

Misha

1 Answers

My solution with some explanation below.

library(gapminder)
library(purrr)
library(tidyr)
library(broom)

nest_data <- gapminder %>% group_by(continent) %>% nest(.key = by_continent)

First question was: how to nest by_country inside the nested by_continent

Great solution by @aosmith on the comments

nested_again<-
nest_data %>% mutate(by_continent = map(by_continent, ~.x %>% 
                                          group_by(country) %>% 
                                          nest(.key = by_country)))
# Level 1
nested_again
# # A tibble: 5 × 2
# continent      by_continent
# <fctr>            <list>
#   1      Asia <tibble [33 × 2]>
#   2    Europe <tibble [30 × 2]>
#   3    Africa <tibble [52 × 2]>
#   4  Americas <tibble [25 × 2]>
#   5   Oceania  <tibble [2 × 2]>

# Level 2
nested_again %>% unnest %>% slice(1:2)
# # A tibble: 2 × 3
# continent     country        by_country
# <fctr>      <fctr>            <list>
#   1      Asia Afghanistan <tibble [12 × 4]>
#   2      Asia     Bahrain <tibble [12 × 4]>

Second question: how to apply a regression model at a deeper level (and save the models on the tibble, I suppose)

Solution by @aosmith (which I'm calling sol1)

sol1<-mutate(nested_again, models = map(by_continent, "by_country") %>%
         at_depth(2, ~lm(lifeExp ~ year, data = .x)))

sol1
# # A tibble: 5 × 3
# continent      by_continent      models
# <fctr>            <list>      <list>
#   1      Asia <tibble [33 × 2]> <list [33]>
#   2    Europe <tibble [30 × 2]> <list [30]>
#   3    Africa <tibble [52 × 2]> <list [52]>
#   4  Americas <tibble [25 × 2]> <list [25]>
#   5   Oceania  <tibble [2 × 2]>  <list [2]>

sol1 %>% unnest(models)
# Error: Each column must either be a list of vectors or a list of data frames [models]
sol1 %>% unnest(by_continent) %>% slice(1:2)
# # A tibble: 2 × 3
#   continent     country        by_country
#      <fctr>      <fctr>            <list>
# 1      Asia Afghanistan <tibble [12 × 4]>
# 2      Asia     Bahrain <tibble [12 × 4]>

The solution is doing what it is supposed to, but there's no easy way to filter by country, because that information is nested in the level 2.

I propose the solution 2, based on @aosmith's solution to the first question:

sol2<-nested_again %>% mutate(by_continent = map(by_continent, ~.x %>% 
                  mutate(models = map(by_country, ~lm(lifeExp ~ year, data = .x) )) ))
sol2
# # A tibble: 5 × 2
#   continent      by_continent
#      <fctr>            <list>
# 1      Asia <tibble [33 × 4]>
# 2    Europe <tibble [30 × 4]>
# 3    Africa <tibble [52 × 4]>
# 4  Americas <tibble [25 × 4]>
# 5   Oceania  <tibble [2 × 4]>

sol2 %>% unnest %>% slice(1:2)
# # A tibble: 2 × 4
#   continent     country        by_country   models
#      <fctr>      <fctr>            <list>   <list>
# 1      Asia Afghanistan <tibble [12 × 4]> <S3: lm>
# 2      Asia     Bahrain <tibble [12 × 4]> <S3: lm>

sol2 %>% unnest %>% unnest(by_country) %>% colnames
# [1] "continent" "country"   "year"      "lifeExp"   "pop"      
# [6] "gdpPercap"

# get model by specific country
sol2 %>% unnest %>% filter(country == "Brazil") %$% models %>% extract2(1)
# Call:
#   lm(formula = lifeExp ~ year, data = .x)
# 
# Coefficients:
#   (Intercept)         year  
# -709.9427       0.3901

# summary with broom::tidy
sol2 %>% unnest %>% filter(country == "Brazil") %$% models %>%
  extract2(1) %>% tidy
#          term     estimate    std.error statistic      p.value
# 1 (Intercept) -709.9426860 10.801042821 -65.72909 1.617791e-14
# 2        year    0.3900895  0.005456243  71.49417 6.990433e-15

We can tidy all the models and save in the data to use for plotting or filter

sol2 %<>% mutate(by_continent = map(by_continent, ~.x %>% 
        mutate(tidymodels = map(models, tidy )) ))

sol2 %>% unnest %>% unnest(tidymodels) %>% 
  ggplot(aes(country,p.value,colour=continent))+geom_point()+
  facet_wrap(~continent)+
  theme(axis.text.x = element_blank())

Linear model p-value by country

selc <- sol2 %>% unnest %>% unnest(tidymodels) %>% filter(p.value > 0.05) %>% 
  select(country) %>% unique %>% extract2(1)

gapminder %>% filter(country %in% selc ) %>%
  ggplot(aes(year,lifeExp,colour=continent))+geom_line(aes(group=country))+
  facet_wrap(~continent)

Filtered by p-value

aaaaand, we can use the models

m1 <- sol2 %>% unnest %>% slice(1) %$% models %>% extract2(1)

x <- sol2 %>% unnest %>% slice(1) %>% unnest(by_country) %>% select(year)

pred1 <- data.frame(year = x, lifeExp = predict.lm(m1,x))

sol2 %>% unnest %>% slice(1) %>% unnest(by_country) %>%
  ggplot(aes(year, lifeExp )) + geom_point() +
  geom_line(data=pred1)

using a model to predict

In this case there's really no good reason to use this double nesting (besides learning how to to it, of course), but I found a case in my work where it is extremely valuable, specifically when you need a function to work on a 3rd level, grouped by levels 1 and 2, and save in level 2 - of course for this we could also use a for loop on level 1, but what's the fun in that ;) I'm not really sure how this "nested" map performs compared to for loop + map, but I'll test it next.

Benchmark

It looks like they do not differ much

# comparison map_map with for_map
map_map<-function(nested_again){
nested_again %>% mutate(by_continent = map(by_continent, ~.x %>% 
  mutate(models = map(by_country, ~lm(lifeExp ~ year, data = .x) )) )) }

for_map<-function(nested_again){ for(i in 1:length(nested_again[[1]])){
  nested_again$by_continent[[i]] %<>%
  mutate(models = map(by_country, ~lm(lifeExp ~ year, data = .x) )) }}

res<-microbenchmark::microbenchmark(
  mm<-map_map(nested_again), fm<-for_map(nested_again) )

res
# Unit: milliseconds
#                         expr      min       lq     mean   median       uq      max neval cld
#  mm <- map_map(nested_again) 121.0033 144.5530 160.6785 155.2389 174.2915 240.2012   100   a
#  fm <- for_map(nested_again) 131.4312 148.3329 164.7097 157.6589 173.6480 455.7862   100   a

autoplot(res)

enter image description here

157

answered Sep 28 '22 05:09

RBA

Related questions
                            
                                Using the geosphere distm function on a data.table to calculate distances
                            
                                If a text string contains something then return something in R
                            
                                Calculating BIC manually for lm object
                            
                                R Split string and keep substrings righthand of match?
                            
                                Update data in shiny application without refreshing the whole application
                            
                                calculating percentiles for multiple columns in R
                            
                                R / ggplot2: Evaluate object inside expression
                            
                                Package "rvest" for web scraping https site with proxy
                            
                                dplyr suppress next n occurrences of a value in a group
                            
                                Converting R command line arguments into integer vectors
                            
                                Select list element programmatically using name stored as string [duplicate]
                            
                                Add vertical line in a DT datatable in a shiny app with custom container
                            
                                Combining pipes and the magrittr dot (.) placeholder
                            
                                How to add my Outlook email signature to the COM object using RDCOMClient
                            
                                ggplot2: Show difference in values over time with an arrow
                            
                                Order of operation with piping
                            
                                filter and unfilter in dplyr
                            
                                R data.table dot product with matching column names (for each group)
                            
                                shinydashboard does not work with uiOutput
                            
                                setnames in pipeline R code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With