flatten record based list/object into dataframe

Tags:

r

Edit: this question is outdated. The jsonlite package flattens automatically.

I am dealing with online datastreams that have record-based encoding, usually in JSON. The structure of the object (i.e. the names in the JSON) are known from the API documentation, however, values are mostly optional and not present in every record. Lists can contain new lists, and the structure is sometimes quite deep. Here is a quite simple example of some GPS data: http://pastebin.com/raw.php?i=yz6z9t25. Note that in the lower rows, the "l" object is missing due to no GPS signal.

I am looking for an elegant way to flatten these objects into a dataframe. I am currently using something like this:

library(RJSONIO)
library(plyr)

obj <- fromJSON("http://pastebin.com/raw.php?i=yz6z9t25", simplifyWithNames=FALSE, simplify=FALSE)
flatdata <- lapply(obj$data, as.data.frame);
mydf <- rbind.fill(flatdata)

This does the job, however it is slow and a bit error prone. A problem with this approach is that I am not using my knowledge about the structure (object names) in the data; instead it is inferred from the data. This leads to problems when a certain property happens to be absent in every record. In this case, it will not appear in the dataframe at all, instead of a column with NA values. This can lead to issues downstream. For example, I need to process the location timestamp:

mydf$l.t <- structure(mydf$l.t/1000, class="POSIXct")

However, this will result in an error in case of a dataset in which the l$t object isn't there. Furthermore both the as.data.frame and rbind.fill make things quite slow. The example dataset is a relatively small one. Any suggestions for better implementation? A robust solution would always yield a dataframe with the same columns in the same order, and where only the number of rows varies.

Edit: below a dataset with more meta data. It is larger in size and nested more deeply:

obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)

798

asked Jun 25 '12 20:06

Jeroen Ooms

3 Answers

Here's a solution that lets you take advantage of your prior knowledge of data field names and classes. Also, by avoiding repeated calls to as.data.frame and the single call to plyr's rbind.fill() (both time-intensive) it runs about 60 times faster on your example data.

cols <- c("id", "ls", "ts", "l.lo","l.tz", "l.t", "l.ac", "l.la", "l.pr", "m")   
numcols <- c("l.lo", "l.t", "l.ac", "l.la")

## Flatten each top-level list element, converting it to a character vector.
x <- lapply(obj$data, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y), stringsAsFactors=FALSE)
z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]])))

Edit: To confirm that my approach gives results identical to those in the original question, I ran the following test. (Notice that in both cases I set stringsAsFactors=FALSE to avoid meaningless differences in orderings of the factor levels.)

flatdata <- lapply(obj$data, as.data.frame, stringsAsFactors=FALSE)
mydf <- rbind.fill(flatdata)
identical(z, mydf)
# [1] TRUE

Further Edit:

Just for the record, here's an alternate version of the above that in addition automatically:

finds names of all data fields
determines their class/type
coerces the columns of the final data.frame to the correct class

dat <- obj$data

## Find the names and classes of all fields
fields <- unlist(lapply(xx, function(X) rapply(X, class, how="unlist")))
fields <- fields[unique(names(fields))]
cols <- names(fields)

## Flatten each top-level list element, converting it to a character vector.
x <- lapply(dat, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y), stringsAsFactors=FALSE)

## Coerce columns of z (all currently character) back to their original type
z[] <- lapply(seq_along(fields), function(i) as(z[[cols[i]]], fields[i]))

119

answered Oct 25 '22 05:10

Josh O'Brien

Here's an attempt that tries to make no assumptions about the types of the data. It's a bit slower than @JoshOBrien's, but faster than the OP's original solution.

Joshua <- function(x) {
  un <- lapply(x, unlist, recursive=FALSE)
  ns <- unique(unlist(lapply(un, names)))
  un <- lapply(un, function(x) {
    y <- as.list(x)[ns]
    names(y) <- ns
    lapply(y, function(z) if(is.null(z)) NA else z)})
  s <- lapply(ns, function(x) sapply(un, "[[", x))
  names(s) <- ns
  data.frame(s, stringsAsFactors=FALSE)
}

Josh <- function(x) {
  cols <- c("id", "ls", "ts", "l.lo","l.tz", "l.t", "l.ac", "l.la", "l.pr", "m")   
  numcols <- c("l.lo", "l.t", "l.ac", "l.la")
  ## Flatten each top-level list element, converting it to a character vector.
  x <- lapply(obj$data, unlist)
  ## Extract fields that might be present in each record (returning NA if absent).
  y <- sapply(x, function(X) X[cols])
  ## Convert to a data.frame with columns of desired classes.
  z <- as.data.frame(t(y))
  z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]])))
  z
}

Jeroen <- function(x) {
  flatdata <- lapply(x, as.data.frame)
  rbind.fill(flatdata)
}

library(rbenchmark)
benchmark(Josh=Josh(obj$data), Joshua=Joshua(obj$data),
  Jeroen=Jeroen(obj$data), replications=5, order="relative")
#     test replications elapsed  relative user.self sys.self user.child sys.child
# 1   Josh            5    0.24  1.000000      0.24        0         NA        NA
# 2 Joshua            5    0.31  1.291667      0.32        0         NA        NA
# 3 Jeroen            5   12.97 54.041667     12.87        0         NA        NA

answered Oct 25 '22 05:10

Joshua Ulrich

Just for clarity, I am adding a combination of Josh and Joshua's solution which is the best I have come up with so far.

flatlist <- function(mylist){
    lapply(rapply(mylist, enquote, how="unlist"), eval)
}

records2df <- function(recordlist, columns) {
    if(length(recordlist)==0 && !missing(columns)){
      return(as.data.frame(matrix(ncol=length(columns), nrow=0, dimnames=list(NULL,columns))))
    }
    un <- lapply(recordlist, flatlist)
    if(!missing(columns)){
        ns <- columns;
    } else {
        ns <- unique(unlist(lapply(un, names)))
    }
    un <- lapply(un, function(x) {
        y <- as.list(x)[ns]
        names(y) <- ns
        lapply(y, function(z) if(is.null(z)) NA else z)})
    s <- lapply(ns, function(x) sapply(un, "[[", x))
    names(s) <- ns
    data.frame(s, stringsAsFactors=FALSE)
}

The function is reasonably fast. I still think it should be able to speed this up though:

obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)
flatdata <- records2df(obj$data)

It also allows you to 'force' certain columns, although it doesn't result in too much of a speedup:

flatdata <- records2df(obj$data, columns=c("m", "doesnotexist"))

answered Oct 25 '22 06:10

Jeroen Ooms

Related questions
                            
                                How to operator join two matrix in raku-lang？
                            
                                How to write two vectors of different length into one data frame by writing same values into same row?
                            
                                Calling R script from Python does not save log file in version 4
                            
                                How to increase the width of underline drawed in legend labels in ggplot?
                            
                                Cannot fix the lack of memory problem in running "pvargmm"
                            
                                Calculating percent of row total with plyr
                            
                                R: serialize objects to text file and back again
                            
                                How to add a condition to the geom_point size?
                            
                                How do I rename R sessions in ESS?
                            
                                Extracting noun+noun or (adj|noun)+noun from Text
                            
                                How to get the first date from a vector?
                            
                                Is there a way to colorize comments/code in Sweave?
                            
                                writing tables to Postgresql using rPostgreSQL when the database name is all capital letters
                            
                                Get a consistent vector of list element names
                            
                                Could not find any X11 fonts error
                            
                                Updating packages in R - impact of checkBuilt = TRUE or FALSE
                            
                                Safe method for updating R packages - is "hot-swapping" possible?
                            
                                R: Efficiently locating time series segments with maximal cross-correlation to input segment?
                            
                                How to store package-specific settings in R: options() vs. ReferenceClasses
                            
                                Extracting common character strings from multiple vectors of different lengths

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

flatten record based list/object into dataframe

Tags:

r

Jeroen Ooms

People also ask

3 Answers

Josh O'Brien

Joshua Ulrich

Jeroen Ooms

Recent Activity

Donate For Us