Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to avoid looping over list after reading from JSON with R

Tags:

json

r

I have a vector of JSON data in R, and with lapply I extract the information:

 list <- lapply(temp, fromJSON)

The structure of the first element of this list looks like this:

str(list[[1]])

List of 4
 $ boundedBy :List of 2
  ..$ type       : chr "Polygon"
  ..$ coordinates:List of 1
  .. ..$ :List of 5
  .. .. ..$ : num [1:2] 89328 208707
  .. .. ..$ : num [1:2] 89333 208707
  .. .. ..$ : num [1:2] 89333 208713
  .. .. ..$ : num [1:2] 89328 208713
  .. .. ..$ : num [1:2] 89328 208707
 $ hnrlbl    : NULL
 $ opndatum  : chr "2011-05-30"
 $ oidn      : chr "2954841"

This works for the first element: list[[1]]$hnrlbl , but how do I do this at once for the whole list? Something like list[[.]]$hnrlbl

like image 295
Kasper Van Lombeek Avatar asked Aug 26 '14 10:08

Kasper Van Lombeek


4 Answers

In this case you could just use list.map from the rlist package:

mylist <- lapply(temp, fromJSON)
library(rlist)
list.map(mylist, hnrlbl)

http://cran.r-project.org/web/packages/rlist/vignettes/Mapping.html

like image 137
jdharrison Avatar answered Nov 14 '22 23:11

jdharrison


I have a helper function that's useful for these scenarios:

pluck <- function(x, name, type) {
  if (missing(type)) {
    lapply(x, .subset2, name)
  } else {
    vapply(x, .subset2, name, FUN.VALUE = type)
  }
}

(This was inspired by underscore & Winston Chang. .subset2() is an internal version of [[ - it's faster, but doesn't do S3 dispatch which means that x needs to be a plain list).

With this function, solving your problem is easy:

x <- list(
  a = list(x = rnorm(10), y = letters[1:10], z = "OK"),
  b = list(x = rnorm(10), y = letters[11:20], z = "notOK")
)

# List of results
str(pluck(x, "z"))
#> List of 2
#>  $ a: chr "OK"
#>  $ b: chr "notOK"

# Vector of results
str(pluck(x, "z", character(1)))
#>  Named chr [1:2] "OK" "notOK"
#>  - attr(*, "names")= chr [1:2] "a" "b"

(You can also select by position: pluck(x, 2, character(10)))

Benchmarking

This method is also quite fast:

x_big <- rep(x, 1000)

myselect <- function(x,name){
  tmp <- unlist(x, recursive = FALSE)
  id <- grep(paste0("\\.",name,"$"), names(tmp))
  tmp[id]
}

library(microbenchmark)
options(digits = 2)
microbenchmark(
  sapply(x_big, function(i)i$z),
  myselect(x_big,"z"),
  pluck(x_big, "z", character(1))
)
#> Unit: microseconds
#>                             expr  min   lq median   uq  max neval
#>   sapply(x_big, function(i) i$z) 2771 2886   2972 3124 5903   100
#>             myselect(x_big, "z") 2250 2330   2366 2401 3551   100
#>  pluck(x_big, "z", character(1))  717  786    825  889 1731   100
like image 21
hadley Avatar answered Nov 15 '22 00:11

hadley


After a couple of hours looking for the cleanest method, we did:

 kadaster_building_temp$hnrlbl <- sapply(list,function(x){x$hnrlbl} )
like image 28
Kasper Van Lombeek Avatar answered Nov 14 '22 23:11

Kasper Van Lombeek


Warning : by using regular expressions, this solution might fail under some conditions (depending on the names you use in your lists). If speed is not an option, either list.map or the solution using sapply are more robust


You can gain quite some speed by using unlist() here and look for the names. Take the following function myselect:

myselect <- function(x,name){
  tmp <- unlist(x,recursive=FALSE)
  id <- grep(paste0("(^|\\.)",name,"$"),names(tmp))
  tmp[id]
}

This one does about the same but in a vectorized way. By using the argument recursive=FALSE, you flatten the nested list to a flat list (all elements are part of the same list). Then you use the naming convention used by this function to look for all the elements that contain the exact name you want to select. Hence the call to paste0 to construct a regular expression that avoids partial name matches. Simple selection returns you again a list with the wanted elements. If you want this to be a vector or so, you can simply use unlist() on the result.

Note that I presume you have a list of lists, so you only want to flatten one level. For more complicated nesting, this obviously won't work in the current form.


Example and Benchmarking

The speed gain is dependent on the structure of the list obviously, but can go up to a 50fold or more speed gain.

Take following (very basic) example:

aList <- list(
  a=list(x=rnorm(10),y=letters[1:10],z="OK"),
  b=list(x=rnorm(10),y=letters[11:20],z="notOK")
  )

Benchmarking this gives:

require(rbenchmark)
benchmark(
  sapply(aList,function(i)i$z),
  myselect(aList,"z"),
  columns=c("test","elapsed","relative"),
  replications=10000
  )

                            test elapsed relative
2           myselect(aList, "z")    0.24    1.000
1 sapply(aList, function(i) i$z)    0.39    1.625

With larger objects, the improvement can be substantial. Using this on a list I happened to have in my workspace (dput is not an option here...):

> benchmark(
+   sapply(StatN0_1,function(i)i$SP),
+   myselect(StatN0_1,"SP"),
+   columns=c("test","elapsed","relative"),
+   replications=100
+ )
                                test elapsed relative
2           myselect(StatN0_1, "SP")    0.02      1.0
1 sapply(StatN0_1, function(i) i$SP)    1.13     56.5
like image 27
Joris Meys Avatar answered Nov 15 '22 00:11

Joris Meys