Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flatten a deeply and irregularly nested list/JSON in R

I'm trying to flatten a deeply/irregularly nested list/JSON object to a dataframe in R.

The key names are consistent, but the number of nested elements differ from one element to the next.

I've tried flattening the list using jsonlite and the tidyr::unnest function, but tidyr::unnest isn't able to unnest a list-column that contains multiple new columns. I've also tried using the map functions in the purrr package, but can't get anything to work.

A subset of the JSON data is below, and a list object is included at the end of this post.

[
  {
    "name": ["Hillary Clinton"],
    "type": ["PERSON"],
    "metadata": {
      "mid": ["/m/0d06m5"],
      "wikipedia_url": ["http://en.wikipedia.org/wiki/Hillary_Clinton"]
    },
    "salience": [0.2883],
    "mentions": [
      {
        "text": {
          "content": ["Clinton"],
          "beginOffset": [132]
        },
        "type": ["PROPER"]
      },
      {
        "text": {
          "content": ["Mrs."],
          "beginOffset": [127]
        },
        "type": ["COMMON"]
      },
      {
        "text": {
          "content": ["Clinton"],
          "beginOffset": [403]
        },
        "type": ["PROPER"]
      },
      {
        "text": {
          "content": ["Mrs."],
          "beginOffset": [398]
        },
        "type": ["COMMON"]
      },
      {
        "text": {
          "content": ["Hillary Clinton"],
          "beginOffset": [430]
        },
        "type": ["PROPER"]
      }
    ]
  },
  {
    "name": ["Trump"],
    "type": ["PERSON"],
    "metadata": {
      "mid": ["/m/0cqt90"],
      "wikipedia_url": ["http://en.wikipedia.org/wiki/Donald_Trump"]
    },
    "salience": [0.245],
    "mentions": [
      {
        "text": {
          "content": ["Trump"],
          "beginOffset": [24]
        },
        "type": ["PROPER"]
      },
      {
        "text": {
          "content": ["Mr."],
          "beginOffset": [20]
        },
        "type": ["COMMON"]
      }
    ]
  }
] 

And the desired output would be a dataframe like the one below, where the outer elements are repeated and each of the inner-most element have their own rows.

name                type        metadata.mid    metadata.wikipedia_url                          salience    mentions.text.content   mentions.text.beginOffset   mentions.type
Hillary Clinton     PERSON      /m/0d06m5       http://en.wikipedia.org/wiki/Hillary_Clinton    0.2883      Clinton                 132                         PROPER
Hillary Clinton     PERSON      /m/0d06m5       http://en.wikipedia.org/wiki/Hillary_Clinton    0.2883      Mrs.                    127                         COMMON
Hillary Clinton     PERSON      /m/0d06m5       http://en.wikipedia.org/wiki/Hillary_Clinton    0.2883      Clinton                 403                         PROPER
Hillary Clinton     PERSON      /m/0d06m5       http://en.wikipedia.org/wiki/Hillary_Clinton    0.2883      Mrs.                    398                         COMMON
Hillary Clinton     PERSON      /m/0d06m5       http://en.wikipedia.org/wiki/Hillary_Clinton    0.2883      Hillary Clinton         430                         PROPER
Trump               PERSON      /m/0cqt90       http://en.wikipedia.org/wiki/Donald_Trump       0.245       Trump                   24                          PROPER
Trump               PERSON      /m/0cqt90       http://en.wikipedia.org/wiki/Donald_Trump       0.245       Mr.                     20                          COMMON

Is there a general/scalable way to flatten this type of data?


An R list object:

nested_list <- list(structure(list(name = "Hillary Clinton", type = "PERSON", 
    metadata = structure(list(mid = "/m/0d06m5", wikipedia_url = "http://en.wikipedia.org/wiki/Hillary_Clinton"), .Names = c("mid", 
    "wikipedia_url")), salience = 0.28831193, mentions = list(
        structure(list(text = structure(list(content = "Clinton", 
            beginOffset = 132L), .Names = c("content", "beginOffset"
        )), type = "PROPER"), .Names = c("text", "type")), structure(list(
            text = structure(list(content = "Mrs.", beginOffset = 127L), .Names = c("content", 
            "beginOffset")), type = "COMMON"), .Names = c("text", 
        "type")), structure(list(text = structure(list(content = "Clinton", 
            beginOffset = 403L), .Names = c("content", "beginOffset"
        )), type = "PROPER"), .Names = c("text", "type")), structure(list(
            text = structure(list(content = "Mrs.", beginOffset = 398L), .Names = c("content", 
            "beginOffset")), type = "COMMON"), .Names = c("text", 
        "type")), structure(list(text = structure(list(content = "Hillary Clinton", 
            beginOffset = 430L), .Names = c("content", "beginOffset"
        )), type = "PROPER"), .Names = c("text", "type")))), .Names = c("name", 
"type", "metadata", "salience", "mentions")), structure(list(
    name = "Trump", type = "PERSON", metadata = structure(list(
        mid = "/m/0cqt90", wikipedia_url = "http://en.wikipedia.org/wiki/Donald_Trump"), .Names = c("mid", 
    "wikipedia_url")), salience = 0.24501903, mentions = list(
        structure(list(text = structure(list(content = "Trump", 
            beginOffset = 24L), .Names = c("content", "beginOffset"
        )), type = "PROPER"), .Names = c("text", "type")), structure(list(
            text = structure(list(content = "Mr.", beginOffset = 20L), .Names = c("content", 
            "beginOffset")), type = "COMMON"), .Names = c("text", 
        "type")))), .Names = c("name", "type", "metadata", "salience", 
"mentions")))
like image 210
Brian Avatar asked Dec 12 '25 05:12

Brian


1 Answers

One approach:

map_df(nested_list, function(x) {

  df <- flatten_df(x[c("name", "type", "metadata", "salience")]) 

  map_df(x$mentions, ~c(as.list(.$text), mentions_type=.$type)) %>%
    mutate(name=df$name, type=df$type, mid=df$mid,
           wikipedia_url=df$wikipedia_url, salience=df$salience)

}) %>% glimpse()
## Observations: 7
## Variables: 8
## $ content       <chr> "Clinton", "Mrs.", "Clinton", "Mrs.", "Hillary Clinton", "Trump", "Mr."
## $ beginOffset   <int> 132, 127, 403, 398, 430, 24, 20
## $ mentions_type <chr> "PROPER", "COMMON", "PROPER", "COMMON", "PROPER", "PROPER", "COMMON"
## $ name          <chr> "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Hillary Clinton", "Trump", "Trump"
## $ type          <chr> "PERSON", "PERSON", "PERSON", "PERSON", "PERSON", "PERSON", "PERSON"
## $ mid           <chr> "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0d06m5", "/m/0cqt90", "/m/0cqt90"
## $ wikipedia_url <chr> "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikiped...
## $ salience      <dbl> 0.2883119, 0.2883119, 0.2883119, 0.2883119, 0.2883119, 0.2450190, 0.2450190
like image 58
hrbrmstr Avatar answered Dec 13 '25 23:12

hrbrmstr