Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to effectively flatten nested lists and dataframes into a single dataframe?

Tags:

r

I have some data that is formatted in a way that's difficult to use, so I'm trying to flatten it out. The minimum reproducible example is here.

> str(sampleData)
List of 4
 $ Events       :'data.frame':  2 obs. of  3 variables:
  ..$ CateringOptions:List of 2
  .. ..$ :'data.frame': 1 obs. of  3 variables:
  .. .. ..$ Agreed : logi TRUE
  .. .. ..$ Tnc    :'data.frame': 1 obs. of  5 variables:
  .. .. .. ..$ Identity      : chr "SpicyOWing"
  .. .. .. ..$ Schema        : logi NA
  .. .. .. ..$ ElementId     : chr "105031"
  .. .. .. ..$ ElementType   : logi NA
  .. .. .. ..$ ElementVersion: logi NA
  .. .. ..$ Address: chr "New York"
  .. ..$ :'data.frame': 1 obs. of  3 variables:
  .. .. ..$ Agreed : logi TRUE
  .. .. ..$ Tnc    :'data.frame': 1 obs. of  5 variables:
  .. .. .. ..$ Identity      : chr "BaconEggs"
  .. .. .. ..$ Schema        : logi NA
  .. .. .. ..$ ElementId     : chr "105032"
  .. .. .. ..$ ElementType   : logi NA
  .. .. .. ..$ ElementVersion: logi NA
  .. .. ..$ Address: chr "Seattle"
  ..$ Action         : num [1:2] 1 1
  ..$ Volume         : num [1:2] 1000 2000
 $ Host         :List of 5
  ..$ Identity      : chr "John"
  ..$ Schema        : logi NA
  ..$ ElementId     : chr "101505"
  ..$ ElementType   : logi NA
  ..$ ElementVersion: logi NA
 $ Sender       :List of 5
  ..$ Identity      : chr "Jane"
  ..$ Schema        : logi NA
  ..$ ElementId     : chr "101005"
  ..$ ElementType   : logi NA
  ..$ ElementVersion: logi NA
 $ CompletedDate: chr "/Date(1490112000000)/"

Expected

> expectedOutcome
  Events.CateringOptions.Agreed Events.CateringOptions.Tnc.Identity Events.CateringOptions.Tnc.Schema Events.CateringOptions.Tnc.ElementId
1                            NA                          SpicyOWing                                TRUE                               105031
2                            NA                           BaconEggs                                TRUE                               105032
  Events.CateringOptions.Tnc.ElementType Events.CateringOptions.Tnc.ElementVersion Events.CateringOptions.Address Events.Action Events.Volume Host.Identity
1                                     NA                                        NA                       New York             1          1000          John
2                                     NA                                        NA                        Seattle             1          2000          John
  Host.Schema Host.ElementId Host.ElementType Host.ElementVersion Sender.Identity Sender.Schema Sender.ElementId Sender.ElementType Sender.ElementVersion
1          NA         101505               NA                  NA            Jane            NA           101005                 NA                    NA
2          NA         101505               NA                  NA            Jane            NA           101005                 NA                    NA
          CompletedDate
1 /Date(1490112000000)/
2 /Date(1490112000000)/

The check function

check<-function(li){
  areDF<-sapply(1:length(li), function(i) class(li[[i]]) == "data.frame")
  areList<-sapply(1:length(li), function(i) class(li[[i]]) == "list")
  tmp1 <- NULL
  tmp2 <- NULL
  if(any(areDF)){
    for(j in which(areDF)){
      columns <- jsonlite::flatten(li[[j]])
      li[[j]]  <- check(columns)
    }
    tmp1<-plyr::rbind.fill(li[areDF])
    #return(tmp1)
  }
  if(any(areList)){
    for(j in which(areList)){
      li[[j]]<-check(li[[j]])
    }
    tmp2<-do.call(cbind,li)
    #return(tmp2)
  }
  if(!is.null(tmp1) & !is.null(tmp2)){
    return (cbind(tmp1,tmp2))
  }
  else if(!is.null(tmp1)){
    return (tmp1)
  }
  else if(!is.null(tmp2)){
    return (tmp2)
  }
  return(li)
}

Results

> str(check(sampleData))
'data.frame': 2 obs. of  29 variables:
 $ CateringOptions.Agreed                   : logi  TRUE TRUE
 $ CateringOptions.Address                  : chr  "New York" "Seattle"
 $ CateringOptions.Tnc.Identity             : chr  "SpicyOWing" "BaconEggs"
 $ CateringOptions.Tnc.Schema               : logi  NA NA
 $ CateringOptions.Tnc.ElementId            : chr  "105031" "105032"
 $ CateringOptions.Tnc.ElementType          : logi  NA NA
 $ CateringOptions.Tnc.ElementVersion       : logi  NA NA
 $ Action                                   : num  1 1
 $ Volume                                   : num  1000 2000
 $ Events.CateringOptions.Agreed            : logi  TRUE TRUE
 $ Events.CateringOptions.Address           : chr  "New York" "Seattle"
 $ Events.CateringOptions.Tnc.Identity      : chr  "SpicyOWing" "BaconEggs"
 $ Events.CateringOptions.Tnc.Schema        : logi  NA NA
 $ Events.CateringOptions.Tnc.ElementId     : chr  "105031" "105032"
 $ Events.CateringOptions.Tnc.ElementType   : logi  NA NA
 $ Events.CateringOptions.Tnc.ElementVersion: logi  NA NA
 $ Events.Action                            : num  1 1
 $ Events.Volume                            : num  1000 2000
 $ Host.Identity                            : Factor w/ 1 level "John": 1 1
 $ Host.Schema                              : logi  NA NA
 $ Host.ElementId                           : Factor w/ 1 level "101505": 1 1
 $ Host.ElementType                         : logi  NA NA
 $ Host.ElementVersion                      : logi  NA NA
 $ Sender.Identity                          : Factor w/ 1 level "Jane": 1 1
 $ Sender.Schema                            : logi  NA NA
 $ Sender.ElementId                         : Factor w/ 1 level "101005": 1 1
 $ Sender.ElementType                       : logi  NA NA
 $ Sender.ElementVersion                    : logi  NA NA
 $ CompletedDate                            : Factor w/ 1 level "/Date(1490112000000)/": 1 1

I almost have it, but the nested dataframe is being duped. Also, my code takes fairly long. Does anyone have any idea how I can go about flattening this?

Edit:

I added my solution in the end in the gist

like image 567
Jean Avatar asked May 22 '26 08:05

Jean


1 Answers

Here is my take at it, with help from purrr.
The idea is similar to yours, only with a different syntax: flatten() the most nested dataframes, then rbind() them.
If I understand your code properly, mine is slightly different at the end, since I'll try to get a more "jsonlite::flatten-friendly" structure to apply it once more to the end result:

library(jsonlite)
library(purrr)
res <- 
  sampleData %>% 
  modify_if(
    is.list, 
    .f = ~ modify_if(
      .x,
      .p = function(x) all(sapply(x, is.data.frame)), 
      .f = ~ do.call("rbind", lapply(.x, jsonlite::flatten))
    )
  ) %>% 
  as.data.frame() %>% 
  jsonlite::flatten()
str(res)

# 'data.frame': 2 obs. of  20 variables:
#  $ Events.Action                            : num  1 1
#  $ Events.Volume                            : num  1000 2000
#  $ Host.Identity                            : chr  "John" "John"
#  $ Host.Schema                              : logi  NA NA
#  $ Host.ElementId                           : chr  "101505" "101505"
#  $ Host.ElementType                         : logi  NA NA
#  $ Host.ElementVersion                      : logi  NA NA
#  $ Sender.Identity                          : chr  "Jane" "Jane"
#  $ Sender.Schema                            : logi  NA NA
#  $ Sender.ElementId                         : chr  "101005" "101005"
#  $ Sender.ElementType                       : logi  NA NA
#  $ Sender.ElementVersion                    : logi  NA NA
#  $ CompletedDate                            : chr  "/Date(1490112000000)/" "/Date(1490112000000)/"
#  $ Events.CateringOptions.Agreed            : logi  TRUE TRUE
#  $ Events.CateringOptions.Address           : chr  "New York" "Seattle"
#  $ Events.CateringOptions.Tnc.Identity      : chr  "SpicyOWing" "BaconEggs"
#  $ Events.CateringOptions.Tnc.Schema        : logi  NA NA
#  $ Events.CateringOptions.Tnc.ElementId     : chr  "105031" "105032"
#  $ Events.CateringOptions.Tnc.ElementType   : logi  NA NA
#  $ Events.CateringOptions.Tnc.ElementVersion: logi  NA NA

I've got one mismatch with your expectedOutcome but if I may, it might be on your side:

all.equal(expectedOutcome[sort(names(expectedOutcome))], res[sort(names(res))])
# [1] "Component “Events.CateringOptions.Agreed”: 'is.NA' value mismatch: 0 in current 2 in target"
like image 82
Aurèle Avatar answered May 23 '26 22:05

Aurèle



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!