Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tidy nested json tree

Tags:

json

r

This comes up a lot when dealing with API's.

Most of the time, to do real analysis, I'd like to get my dataset tidy, but typically, this requires a solution for each type of tree, rather than something more general.

I figured it would be nice to have one function that generates tidy data (albeit with a ton of NA's in deeply nested trees with many different factor levels.

I have a hackish solution which follows, using unlist(..., recursive = FALSE) + a naming convention,

But I'd like to see if someone here might have a better solution to tidy these kinds of list structures.

#####################
# Some Test Data
aNestedTree = 
  list(a = 1, 
       b = 2, 
       c = list(
         a = list(1:5), 
         b = 2, 
         c = list(
           a = 1, 
           d = 3,
           e = list())),
       d = list(
         y = 3,
         z = 2
       ))

############################################################
# Run through the list and rename all list elements,
# We unlist once at  time, adding "__" at each unlist step
# until the object is no longer a list

renameVars <- function(lst, sep = '__') {
  if(is.list(lst)) {
    names(lst) <- paste0(names(lst),sep)
    renameVars(unlist(lst, recursive = FALSE),sep = sep)
  } else {
    lst
  }
}

res <- renameVars(aNestedTree)

We can check the output and see that we have a strangely named object, But there's a method to this madness.

> res
    a________     b________  c__.a____1__  c__.a____2__  c__.a____3__ 
            1             2             1             2             3 
 c__.a____4__  c__.a____5__   c__.b______ c__.c__.a____ c__.c__.d____ 
            4             5             2             1             3 
  d__.y______   d__.z______ 
            3             2 

Now I put this in a data.table, so I can shape it.

library(data.table)
dt <- data.table(values = res, name = names(res))

# Use some regex to split that name up, along with data.table's tstrsplit
# function to separate them into as many columns as there are nests

> dt[,paste0('V',seq_along(s <- tstrsplit(dt$name,'[__]+(\\.|)'))) := s]
> dt
    values          name V1 V2 V3
 1:      1     a________  a NA NA
 2:      2     b________  b NA NA
 3:      1  c__.a____1__  c  a  1
 4:      2  c__.a____2__  c  a  2
 5:      3  c__.a____3__  c  a  3
 6:      4  c__.a____4__  c  a  4
 7:      5  c__.a____5__  c  a  5
 8:      2   c__.b______  c  b NA
 9:      1 c__.c__.a____  c  c  a
10:      3 c__.c__.d____  c  c  d
11:      3   d__.y______  d  y NA
12:      2   d__.z______  d  z NA

I can then filter for the factor combinations that I want (Or dcast/spread). (Though I'm effectively breaking apart tables at the lowest level if they exist)

I thought about going through bind.c and pulling out the do_unlistto make a function with a flexible naming convention via Rcpp, but my C++ is rusty, so I figured I'd post here before I do anything drastic.

like image 571
Shape Avatar asked Jun 17 '26 19:06

Shape


1 Answers

I tend to lean towards tidyjson as well. In the tidyverse, the behavior you are looking for seems to be in the gather family.

I think the gather family of functions in tidyjson could do with a bit of improvement that would make these helpers unnecessary. Right now, they are very "type-sensitive" and error or throw out types that do not match. In any case, the workaround is not too challenging, although it definitely lacks elegance. Note that the bind_rows variant is presently from my development version and is not mainstream yet. Hopefully this illustrates the idea, though.

Notes on approach:

  • That all values would be numeric (I cast them to character afterwards)
  • Helpers gather elements of the varying types, and bind_rows stacks the datasets together.
  • level is kept track of by level of recursion

First define the helpers:

recurse_gather <- function(.x,.level) {
  .x <- tidyjson::bind_rows(
    gobj(.x,.level)
    , garr(.x,.level)
    , gpersist(.x,.level)
  )

  if (any(as.character(json_types(.x,'type')$type) %in% c('object','array'))) {
    .x <- recurse_gather(.x,.level+1)
  }

  return(.x)
}
gobj <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(type=='object') %>%
    gather_object(paste0('v',.level)) %>%
    select(-type)
}

gpersist <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(! type %in% c('object','array')) %>%
    mutate_(.dots=setNames(
      paste0('as.character(NA)')
      ,paste0('v',.level)
    )) %>%
    select(-type)
}

garr <- function(.x,.level) {
  .x %>% json_types('type') %>%
    filter(type=='array') %>%
    gather_array('arridx') %>%
    append_values_number(paste0('v',.level)) %>%
    mutate_(.dots=setNames(
      paste0('as.character(v',.level,')')
      ,paste0('v',.level)
    )) %>%
    select(-arridx,-type)
}

Then using the helpers is pretty straight-forward.

library(dplyr)
library(tidyjson)

j <- "{\"a\":[1],\"b\":[2],\"c\":{\"a\":[1,2,3,4,5],\"b\":[2],\"c\":{\"a\":[1],\"d\":[3],\"e\":[]}},\"d\":{\"y\":[3],\"z\":[2]}}"
recurse_gather(j, 1) %>% arrange(v1, v2, v3, v4) %>% tbl_df()
#> # A tibble: 12 x 5
#>    document.id    v1    v2    v3    v4
#>  *       <int> <chr> <chr> <chr> <chr>
#>  1           1     a     1  <NA>  <NA>
#>  2           1     b     2  <NA>  <NA>
#>  3           1     c     a     1  <NA>
#>  4           1     c     a     2  <NA>
#>  5           1     c     a     3  <NA>
#>  6           1     c     a     4  <NA>
#>  7           1     c     a     5  <NA>
#>  8           1     c     b     2  <NA>
#>  9           1     c     c     a     1
#> 10           1     c     c     d     3
#> 11           1     d     y     3  <NA>
#> 12           1     d     z     2  <NA>

Hopeful that future development on the tidyjson package will make this an easier problem to tackle!

like image 128
cole Avatar answered Jun 19 '26 10:06

cole



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!