Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting data from irregular lists using purrr:map()

Given is a list with several element, the goal is to get them into a data frame. The map_df function from the purr package is highly useful with regular lists, but gives an error with irregular lists.

For instance, following this tutorial the following works:

library(purrr)
library(repurrrsive) # The data comes from this package


map_dfr(got_chars, magrittr::extract, c("name", "culture", "gender", "id", "born", "alive"))

 A tibble: 30 x 6
   name               culture  gender    id born                                   alive
   <chr>              <chr>    <chr>  <int> <chr>                                  <lgl>
 1 Theon Greyjoy      Ironborn Male    1022 In 278 AC or 279 AC, at Pyke           TRUE 
 2 Tyrion Lannister   ""       Male    1052 In 273 AC, at Casterly Rock            TRUE 
 3 Victarion Greyjoy  Ironborn Male    1074 In 268 AC or before, at Pyke           TRUE 
 4 Will               ""       Male    1109 ""                                     FALSE
 5 Areo Hotah         Norvoshi Male    1166 In 257 AC or before, at Norvos         TRUE 
 6 Chett              ""       Male    1267 At Hag's Mire                          FALSE
 7 Cressen            ""       Male    1295 In 219 AC or 220 AC                    FALSE
 8 Arianne Martell    Dornish  Female   130 In 276 AC, at Sunspear                 TRUE 
 9 Daenerys Targaryen Valyrian Female  1303 In 284 AC, at Dragonstone              TRUE 
10 Davos Seaworth     Westeros Male    1319 In 260 AC or before, at King's Landing TRUE 
# … with 20 more rows

However, if an element is removed from the list, the function fails.

got_chars[[1]]["gender"]<-NULL
map_dfr(got_chars, magrittr::extract, c("name", "culture", "gender", "id", "born", "alive"))

#Error: Argument 3 is a list, must contain atomic vectors

The desired output would be an NA value for the missing element. What would an elegant solution be? I suspect the solution includes using purrr:possibly(), but I haven't figured it out yet.

like image 226
Boaz Sobrado Avatar asked Dec 13 '22 10:12

Boaz Sobrado


2 Answers

The devel version of tidyr has powerful new "unnesting" functions and they can handle this problematic data (Option 1). Another approach to this is to attack the problem column-wise, which lets you use the .default argument to purrr::map(), which provides a value to use for missing elements (Option 2).

library(tidyverse)   # purrr, tidyr, and dplyr
library(repurrrsive) # The data comes from this package

got_chars_mutilated <- got_chars
got_chars_mutilated[[1]]["gender"] <- NULL

# original problem
map_dfr(
  got_chars_mutilated,
  magrittr::extract,
  c("name", "culture", "gender", "id", "born", "alive")
)
#> Error: Argument 3 is a list, must contain atomic vectors

# Option 1:
# expanded unnest_*() functions coming soon in tidyr
packageVersion("tidyr")
#> [1] '0.8.99.9000'

# automatic unnesting leads to ... unnest_wider()
tibble(got = got_chars_mutilated) %>% 
  unnest_auto(got)
#> Using `unnest_wider(got)`; elements have {n_common} names in common
#> # A tibble: 30 x 18
#>    url      id name  culture born  died  alive titles aliases father mother
#>    <chr> <int> <chr> <chr>   <chr> <chr> <lgl> <list> <list>  <chr>  <chr> 
#>  1 http…  1022 Theo… Ironbo… In 2… ""    TRUE  <chr … <chr [… ""     ""    
#>  2 http…  1052 Tyri… ""      In 2… ""    TRUE  <chr … <chr [… ""     ""    
#>  3 http…  1074 Vict… Ironbo… In 2… ""    TRUE  <chr … <chr [… ""     ""    
#>  4 http…  1109 Will  ""      ""    In 2… FALSE <chr … <chr [… ""     ""    
#>  5 http…  1166 Areo… Norvos… In 2… ""    TRUE  <chr … <chr [… ""     ""    
#>  6 http…  1267 Chett ""      At H… In 2… FALSE <chr … <chr [… ""     ""    
#>  7 http…  1295 Cres… ""      In 2… In 2… FALSE <chr … <chr [… ""     ""    
#>  8 http…   130 Aria… Dornish In 2… ""    TRUE  <chr … <chr [… ""     ""    
#>  9 http…  1303 Daen… Valyri… In 2… ""    TRUE  <chr … <chr [… ""     ""    
#> 10 http…  1319 Davo… Wester… In 2… ""    TRUE  <chr … <chr [… ""     ""    
#> # … with 20 more rows, and 7 more variables: spouse <chr>,
#> #   allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
#> #   playedBy <list>, gender <chr>

# let's do it again, calling the proper function, and inspect `gender`
tibble(got = got_chars_mutilated) %>% 
  unnest_wider(got) %>% 
  pull(gender)
#>  [1] NA       "Male"   "Male"   "Male"   "Male"   "Male"   "Male"  
#>  [8] "Female" "Female" "Male"   "Female" "Male"   "Female" "Male"  
#> [15] "Male"   "Male"   "Female" "Female" "Female" "Male"   "Male"  
#> [22] "Male"   "Male"   "Male"   "Male"   "Female" "Male"   "Male"  
#> [29] "Male"   "Female"

# Option 2:
# attack this column-wise
# mapping the names gives access to the `.default` argument for missing elements
c("name", "culture", "gender", "id", "born", "alive") %>% 
  set_names() %>% 
  map(~ map(got_chars_mutilated, .x, .default = NA)) %>%
  map(simplify) %>% 
  as_tibble()
#> # A tibble: 30 x 6
#>    name           culture  gender      id born                        alive
#>    <chr>          <chr>    <list>   <int> <chr>                       <lgl>
#>  1 Theon Greyjoy  Ironborn <lgl [1…  1022 In 278 AC or 279 AC, at Py… TRUE 
#>  2 Tyrion Lannis… ""       <chr [1…  1052 In 273 AC, at Casterly Rock TRUE 
#>  3 Victarion Gre… Ironborn <chr [1…  1074 In 268 AC or before, at Py… TRUE 
#>  4 Will           ""       <chr [1…  1109 ""                          FALSE
#>  5 Areo Hotah     Norvoshi <chr [1…  1166 In 257 AC or before, at No… TRUE 
#>  6 Chett          ""       <chr [1…  1267 At Hag's Mire               FALSE
#>  7 Cressen        ""       <chr [1…  1295 In 219 AC or 220 AC         FALSE
#>  8 Arianne Marte… Dornish  <chr [1…   130 In 276 AC, at Sunspear      TRUE 
#>  9 Daenerys Targ… Valyrian <chr [1…  1303 In 284 AC, at Dragonstone   TRUE 
#> 10 Davos Seaworth Westeros <chr [1…  1319 In 260 AC or before, at Ki… TRUE 
#> # … with 20 more rows

Created on 2019-08-15 by the reprex package (v0.3.0.9000)

like image 91
jennybryan Avatar answered Dec 22 '22 00:12

jennybryan


One way is to define a partial()ly-specified pluck() that extracts a name of interest, returning NA if it's missing. Pass the modified pluck() to a double-map, with the inner map traversing the names to extract and the outer map traversing your got_chars list:

v <- set_names(c("name", "culture", "gender", "id", "born", "alive"))
map_dfr( got_chars, ~map(v, partial(pluck, .x, .default=NA)) )
# # A tibble: 30 x 6
#    name             culture  gender    id born                             alive
#    <chr>            <chr>    <chr>  <int> <chr>                            <lgl>
#  1 Theon Greyjoy    Ironborn NA      1022 In 278 AC or 279 AC, at Pyke     TRUE 
#  2 Tyrion Lannister ""       Male    1052 In 273 AC, at Casterly Rock      TRUE 
#  3 Victarion Greyj… Ironborn Male    1074 In 268 AC or before, at Pyke     TRUE 
#  4 Will             ""       Male    1109 ""                               FALSE
#  5 Areo Hotah       Norvoshi Male    1166 In 257 AC or before, at Norvos   TRUE 
#  6 Chett            ""       Male    1267 At Hag's Mire                    FALSE
#  7 Cressen          ""       Male    1295 In 219 AC or 220 AC              FALSE
#  8 Arianne Martell  Dornish  Female   130 In 276 AC, at Sunspear           TRUE 
#  9 Daenerys Targar… Valyrian Female  1303 In 284 AC, at Dragonstone        TRUE 
# 10 Davos Seaworth   Westeros Male    1319 In 260 AC or before, at King's … TRUE 
# # … with 20 more rows

To clarify, .x iterates over got_chars because it lives inside a lambda function specified with ~, so it corresponds to the outer map. The function for the inner map is specified with partial(), which attaches the currently looked-at got_chars element (i.e., the .x) as the first argument to pluck(). The modified pluck() then accepts the name to extract as its (new) first argument, so it can be passed to the inner map as-is, without any extra ~ needed.

like image 33
Artem Sokolov Avatar answered Dec 21 '22 23:12

Artem Sokolov