Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting from Nested list to data frame

Tags:

list

dataframe

r

I will put dput of what my list looks like at the bottom such that the q can be reproducible. The dput is of a not x.

I have a big nested list called x that I'm trying to build a data frame from but cannot figure it out.

I have done the first part:

for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
zz <- as.data.frame(do.call(rbind,lapply(a, `length<-`, max(indx))))}

For this I used the following answer: Converting nested list (unequal length) to data frame

This leaves me a data.frame with n columns for results where n is the max results for any i:

  v1   v2   v3
1 NULL NULL NULL
2  *    *    *
3 NULL NULL NULL

Each * is another nested list in the format list(experience = list(duration = ...

For example the first * in row 2, column v1. I don't want the total list. I only want:

a[[2]][[1]]$experience$start

or in terms of the original list x:

x$results[[2]]$experiences[[1]]$experience$start

I feel like I'm nearly there with some tweaks. I tried:

for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
indx <- lengths(a)
for(y in 1:length(a[[i]])) aa <- rbind(aa,tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(NA)))
zz <- as.data.frame(do.call(rbind,lapply(aa, `length<-`, max(indx))))}

Resulting in:

  v1     v2     v3
1  NA     NA     NA
2  NA     NA     NA
3 2014    NA     NA
4 2012    NA     NA
5 2006    NA     NA
6  NA     NA     NA
7  NA     NA     NA 

Tried cbind instead of rbind on final line and that put all the dates in the first row.

I also tried the following:

for(i in 1:3){a[[i]]<-lengths(x$results[[i]]$experiences)
  indx <- lengths(a)
for(y in 1:length(indx)){tt[i] <- tryCatch(x$results[[i]]$experiences[[y]]$experience$start, error=function(e) print(""))}
zz <- as.data.frame(do.call(rbind,lapply(tt, `length<-`, max(indx))))}

This came close, builds the right format but only returns the first result:

  v1   v2  v3
1 NA   NA  NA
2 2014 NA  NA
3 NA   NA  NA

The format I want is:

 V1  V2  V3
1 NA  NA  NA
2 2014 2012 2006
3 NA  NA  NA

((Sample data now at bottom))

Newest attempt:

Doing the following but returns only the first start date from each a[[i]], the second loop I need to make the list aa[i][y] something different.

 for(i in 1:3){a[[i]]<-x$results[[i]]$experiences
 for(y in 1:length(a[[i]])){aa[i][y] = if(is.null(a[[i]][[y]]$experience$start)){"NULL"}else{a[[i]][[y]]$experience$start}}}

So for dput2 I'd like the form:

  v1    v2  v3   v4   v5   v6   v7   v8
1 2015
2 2011 2007 null null null null null null
3 2016 2015 2015 2015 2013 2010

I dont mind if the blanks are null or na

UPDATE

The below answer almost works, however in my data the structure changes, the order of the names (roleName, duration etc) change so that ruins the answer as cumsum is used to determine when a new list is found. If you have duration then start the keys are 9 and 1 and the cumsum part labels them two different lists.

I wrote the following:

my.list <- list(structure(
  list(
    experience = structure(
      list(
        start = "1",
        end = "1",
        roleName = "a",
        summary = "a",
        duration = "a",
        current = "a",
        org = structure(list(name = "a", url = "a"), .Names = c("name","url")),
        location = structure(
          list(
            displayLocation = NULL,
            lat = NULL,
            lng = NULL
          ),
          .Names = c("displayLocation",
                     "lat", "lng")
        ) ),.Names = c("start", "end", "roleName", "summary", "duration", "current", "org", "location")),
    `_meta` = structure(
      list(weight = 1L, `_sources` = list(structure(
        list(`_origin` = "a"), .Names = "_origin"
      ))),.Names = c("weight", "_sources"))),.Names = c("experience", "_meta")))

Then:

aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), 
                     function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})

This changes the structure such that key2 will always be in the right order.

However Then I found after this loop I have another issue.

Sometimes I have for example nothing but a roleName in the experience list. If that occurs twice in a row the keys are repeated. cumsum treats them as the same experience instead of separate ones.

This means I cannot create df3 because of duplicate identifiers for rows. And even if I could by removing troublesome rows, the names wouldn't match as i in the solution below matches the names using the sequence, if I remove any rows that changes the lengths.

Here is my total code for more insight:

for(i in 1:x$count){a[[i]]<-x$results[[i]]$experiences}

  aa <- lapply(1:length(a), function(y){tryCatch(lapply(1:length(a[[y]]), 
                     function(i){a[[y]][[i]]$experience[names(my.list2[[1]]$experience)]}), error=function(e) print(list()))})

  aaa <- unlist(aa)
  dummydf <- data.frame(b=c("start", "end", "roleName", "summary", 
                            "duration", "current", "org.name",  "org.url"), key=1:8)

  df <- data.frame(a=aaa, b=names(aaa))
  df2 <- left_join(df, dummydf)
  df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)

  df_split <- split(df2, df2$key2)
  df3 <- lapply(df_split, function(x){
    x %>% select(-c(key, key2)) %>% spread(b, a)
  }) %>% data.table::rbindlist(fill=TRUE) %>% t
  df3 <- data.frame(df3)
  i <- sapply(seq_along(aa), function(y) rep(y, sapply(aa, function(x) length(x))[y])) %>% unlist
  names(df3) <- paste0(names(df3), "_", i)
  df4 <- data.frame(t(df3))
  df4$dates <- as.Date(NA)
  df4$dates <- as.Date(df4$start)
  df4 <- data.frame(dates = df4$dates)
  df4 <- t(df4)
  df4 <- data.frame(df4)
  names(df4) <- paste0(names(df4), "_", i)
  df4[] <- lapply(df4[], as.character)
  l1 <- lapply(split(stack(df4), sub('.*_', '', stack(df4)[,2])), '[', 1)
  df5 <- t(do.call(cbindPad, l1))
  df5 <- data.frame(df5)

cbindpad taken from this question

New sample code including the issues:

dput3 = 
list(list(), list(
structure(list(experience = structure(list(
  duration = "1", start = "2014", 
  end = "3000", roleName = "a", 
  summary = "aaa", 
  org = structure(list(name = "a"), .Names = "name"), 
  location = structure(list(displayLocation = NULL, lat = NULL, 
    lng = NULL), .Names = c("displayLocation", "lat", "lng"
    ))), .Names = c("duration", "start", "end", "roleName", "summary", 
    "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
      structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
      "_sources"))), .Names = c("experience", "_meta")), 
structure(list(
        experience = structure(list(end = "3000", 
        start = "2012", duration = "2", 
        roleName = "a", summary = "aaa", 
        org = structure(list(name = "None"), .Names = "name"), 
        location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName", 
        "summary", "org", "location")), `_meta` = structure(list(
          weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")), 
  structure(list(
            experience = structure(list(duration = "3", 
            start = "2006", end = "3000", 
            roleName = "a", summary = "aaa", org = structure(list(name = " "), .Names = "name"), 
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("duration", "start", "end", "roleName",
            "summary", "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
            "_sources"))), .Names = c("experience", "_meta")),
  structure(list(
            experience = structure(list(roleName = "a",  
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", 
           "location")), `_meta` = structure(list(
            weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta")),
structure(list(
            experience = structure(list(roleName = "a",  
            location = structure(list(displayLocation = NULL, lat = NULL, lng = NULL), .Names = c("displayLocation", "lat", "lng"))), .Names = c("roleName", 
            "location")), `_meta` = structure(list(
            weight = 1L, `_sources` = list(structure(list(`_origin` = " "), .Names = "_origin"))), .Names = c("weight", "_sources"))), .Names = c("experience", "_meta"))
            ), 
            list(
structure(list(experience = structure(list(
              duration = "1", start = "2014", 
              end = "3000", roleName = "a", 
              summary = "aaa", 
              org = structure(list(name = "a"), .Names = "name"), 
              location = structure(list(displayLocation = NULL, lat = NULL, 
                lng = NULL), .Names = c("displayLocation", "lat", "lng"
                ))), .Names = c("duration", "start", "end", "roleName", "summary", 
                "org", "location")), `_meta` = structure(list(weight = 1L, `_sources` = list(
                  structure(list(`_origin` = ""), .Names = "_origin"))), .Names = c("weight", 
                  "_sources"))), .Names = c("experience", "_meta"))))
like image 262
Olivia Avatar asked Nov 09 '22 03:11

Olivia


1 Answers

Maybe this can help

library(dplyr)
library(tidyr)

a <- unlist(a)

df <- data.frame(a=a, b=names(a)) %>% mutate(key=cumsum(b=="experience.duration")) %>% 
      split(.$key) %>% lapply(function(x) x %>% select(-key) %>% spread(b, a)) %>% 
      do.call(rbind, .) %>% t %>% data.frame

df$key <- rownames(df)

Then you can filter in on the rows of interest

The above would be equivalent to

rbind(unlist(a)[1:8], unlist(a)[9:16],unlist(a)[17:24]) %>% t

Update

try this for dput2

a <- unlist(dput2)

library(dplyr)
library(tidyr)

dummydf <- data.frame(b=c("experience.start", "experience.end", "experience.roleName", "experience.summary", 
                      "experience.org", "experience.org.name",  "experience.org.url", 
                      "_meta.weight", "_meta._sources._origin", "experience.duration"), key=1:10)


df <- data.frame(a=a, b=names(a))

df2 <- left_join(df, dummydf)
df2$key2 <- as.factor(cumsum(df2$key < c(0, df2$key[-length(df2$key)])) +1)
df_split <- split(df2, df2$key2)
df3 <- lapply(df_split, function(x){
       x %>% select(-c(key, key2)) %>% spread(b, a)
       }) %>% data.table::rbindlist(fill=TRUE) %>% t

df3 <- data.frame(df3)
i <- sapply(seq_along(dput2), function(y) rep(y, sapply(dput2, function(x) length(x))[y])) %>% unlist
names(df3) <- paste0(names(df3), "_", i)

View(df3)
like image 181
dimitris_ps Avatar answered Nov 15 '22 08:11

dimitris_ps