This comes up a lot when dealing with API's.
Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.
I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.
I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,
But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.
#####################
# Some Test Data
aNestedTree =
list(a = 1,
b = 2,
c = list(
a = list(1:5),
b = 2,
c = list(
a = 1,
d = 3,
e = list())),
d = list(
y = 3,
z = 2
))
############################################################
# Run through the list and rename all list elements,
# We unlist once at time, adding "__" at each unlist step
# until the object is no longer a list
renameVars <- function(lst, sep = '__') {
if(is.list(lst)) {
names(lst) <- paste0(names(lst),sep)
renameVars(unlist(lst, recursive = FALSE),sep = sep)
} else {
lst
}
}
res <- renameVars(aNestedTree)
We can check the output and see that we have a strangely named object, But there's a method to this madness.
> res
a________ b________ c__.a____1__ c__.a____2__ c__.a____3__
1 2 1 2 3
c__.a____4__ c__.a____5__ c__.b______ c__.c__.a____ c__.c__.d____
4 5 2 1 3
d__.y______ d__.z______
3 2
Now I put this in a data.table, so I can shape it.
library(data.table)
dt <- data.table(values = res, name = names(res))
# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests
> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
values name V1 V2 V3
1: 1 a________ a NA NA
2: 2 b________ b NA NA
3: 1 c__.a____1__ c a 1
4: 2 c__.a____2__ c a 2
5: 3 c__.a____3__ c a 3
6: 4 c__.a____4__ c a 4
7: 5 c__.a____5__ c a 5
8: 2 c__.b______ c b NA
9: 1 c__.c__.a____ c c a
10: 3 c__.c__.d____ c c d
11: 3 d__.y______ d y NA
12: 2 d__.z______ d z NA
I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)
I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.
I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.
I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.
Notes on approach:
bind_rows stacks the datasets together.First define the helpers:
recurse_gather <- function(.x,.level) {
.x <- tidyjson::bind_rows(
gobj(.x,.level)
, garr(.x,.level)
, gpersist(.x,.level)
)
if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
.x <- recurse_gather(.x,.level+1)
}
return(.x)
}
gobj <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='object') %>%
gather_object(paste0('v',.level)) %>%
select(-type)
}
gpersist <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(! type %in% c('object','array')) %>%
mutate_(.dots=setNames(
paste0('as.character(NA)')
,paste0('v',.level)
)) %>%
select(-type)
}
garr <- function(.x,.level) {
.x %>% json_types('type') %>%
filter(type=='array') %>%
gather_array('arridx') %>%
append_values_number(paste0('v',.level)) %>%
mutate_(.dots=setNames(
paste0('as.character(v',.level,')')
,paste0('v',.level)
)) %>%
select(-arridx,-type)
}
Then using the helpers is pretty straight-forward.
library(dplyr)
library(tidyjson)
j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#> document.id v1 v2 v3 v4
#> * <int> <chr> <chr> <chr> <chr>
#> 1 1 a 1 <NA> <NA>
#> 2 1 b 2 <NA> <NA>
#> 3 1 c a 1 <NA>
#> 4 1 c a 2 <NA>
#> 5 1 c a 3 <NA>
#> 6 1 c a 4 <NA>
#> 7 1 c a 5 <NA>
#> 8 1 c b 2 <NA>
#> 9 1 c c a 1
#> 10 1 c c d 3
#> 11 1 d y 3 <NA>
#> 12 1 d z 2 <NA>
Hopeful that future development on the tidyjson package will make this an easier problem to tackle!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With