I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence. But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop. What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices? <pre class="prettyprint"><code>## Dummy Data n1=100;n2=300;n3=100 crdOrig <- list( list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))), list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))), list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60))) ) ## Code to opimize FiltRef <- list() filterBy = 10 for (r in 1:length(crdOrig)) { tmp <- do.call(rbind, crdOrig[[r]]) filterInd <- seq(1,nrow(tmp), by = filterBy) FiltRef[[r]] <- tmp[filterInd,] } crdResult <- do.call(rbind, FiltRef) # Plotting crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F)) plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20) points(crdResult[,1], crdResult[,2], col="green", pch=20) </code></pre> The code above works also if a list contains several data.frames (data below). <pre class="prettyprint"><code>## Dummy Data (Multiple DF) crdOrig <- list( list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)), data.frame(x = runif(n1,10,20), y = runif(n1,40,60))), list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))), list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60))) ) </code></pre> But if a list contains multiple lists, it throws an error trying to bind the result (<code>FiltRef</code>) together. The result can be a data.frame with 2 columns (x,y) - like <code>crdResult</code> or a one dimensional list like <code>FiltRef</code> (from the first example) <pre class="prettyprint"><code>## Dummy Data (Multiple Lists) crdOrig <- list( list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))), list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))), list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))), list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60))) ) </code></pre> <hr> +1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to @Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!

I would just flatten the whole darn thing and work on a clean list. <pre class="prettyprint"><code>library(rlist) out <- list.flatten(y) # prepare a vector for which columns belong together vc <- rep(1:(length(out)/2), each = 2) vc <- split(1:length(vc), vc) # prepare the final list ll <- vector("list", length(unique(vc))) for (i in 1:length(vc)) { ll[[i]] <- as.data.frame(out[vc[[i]]]) } result <- lapply(ll, FUN = function(x) { x[sample(1:nrow(x), size = 10, replace = FALSE), ] }) do.call(rbind, result) x y 98 10.32912 52.87113 52 16.42912 46.07026 92 18.85397 46.26403 90 12.04884 57.79290 23 18.20997 40.57904 27 18.98340 52.55919 ... </code></pre>

Preparation and implementation of <code>flatten</code> Well, there are many other answers which are in principle the same. I meanwhile implemented for fun the flattening of nested lists. Since I am thinking in Lisp: Implemented first <code>car</code> and <code>cdr</code> from lisp. <pre class="prettyprint"><code>car <- function(l) { if(is.list(l)) { if (null(l)) { list() } else { l[[1]] } } else { error("Not a list.") } } cdr <- function(l) { if (is.list(l)) { if (null(l) || length(l) == 1) { list() } else { l[2:length(l)] } } else { error("Not a list.") } } </code></pre> Some predicate functions: <pre class="prettyprint"><code>null <- function(l) length(l) == 0 # this is Lisp's `null` checking whether list is empty (`length(l) == 0`) # R's `is.null()` checks for the value NULL and not `length(obj) == 0` # upon @Martin Morgan's comment removed other predicate functions # thank you @Martin Morgan! # instead using `is.data.frame()` and `is.list()`, since they are # not only already there but also safer. </code></pre> Which are necessary to build flatten (for data frame lists) <pre class="prettyprint"><code>flatten <- function(nested.list.construct) { # Implemented Lisp's flatten tail call recursively. (`..flatten()`) # Instead of (atom l) (is.df l). ..flatten <- function(l, acc.l) { if (null(l)) { acc.l } else if (is.data.frame(l)) { # originally one checks here for is.atom(l) acc.l[[length(acc.l) + 1]] <- l acc.l # kind of (list* l acc.l) } else { ..flatten(car(l), ..flatten(cdr(l), acc.l)) } } ..flatten(nested.list.construct, list()) } # an atom is in the widest sence a non-list object </code></pre> After this, the actual function is defined using a sampling function. Defining sampling function <pre class="prettyprint"><code># helper function nrow <- function(df) dim(df)[1L] # sampling function sample.one.nth.of.rows <- function(df, fraction = 1/10) { # Randomly selects a fraction of the rows of a data frame nr <- nrow(df) df[sample(nr, fraction * nr), , drop = FALSE] } </code></pre> The actual collector function (from nested data-frame-lists) <pre class="prettyprint"><code>collect.df.samples <- function(df.list.construct, fraction = 1/10) { do.call(rbind, lapply(flatten(df.list.construct), function(df) sample.one.nth.of.rows(df, fraction) ) ) } # thanks for the improvement with `do.call(rbind, [list])` @Ryan! # and the hint that `require(data.table)` # `data.table::rbindlist([list])` would be even faster. </code></pre> <code>collect.df.samples</code> first flattens the nested list construct of data frames <code>df.list.construct</code> to a flat list of data frames. It applies the function <code>sample.one.nth.of.rows</code> to each elements of the list (<code>lapply</code>). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are <code>rbind</code>ed across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames. Testing on example <pre class="prettyprint"><code>## Dummy Data (Multiple Lists) n1=100;n2=300;n3=100 crdOrig <- list( list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))), list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))), list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))), list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60))) ) collect.df.samples(crdOrig, fraction = 1/10) </code></pre> Refactoring for later modifications By writing the <code>collect.df.samples</code> function to: <pre class="prettyprint"><code># sampler function sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10) # refactored: collect.df.samples <- function(df.list.construct, df.sampler.fun = sample.10th.fraction) { do.call(rbind, lapply(flatten(df.list.construct), df.sampler.fun)) } </code></pre> One can make the sampler function replace-able. (And if not: By changing the <code>fraction</code> parameter, one can enhance or reduce amount of rows collected from each data frame.) The sampler function is in this definition easily exchangable For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling, you could e.g. use the sampler function: <pre class="prettyprint"><code>df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE] </code></pre> and input it as <code>df.sampler.fun =</code> in <code>collect.df.samples</code>. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame. <pre class="prettyprint"><code>every.10th.rows <- function(df, nth = 10) { df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE] } a.10th.of.all.rows <- function(df, fraction = 1/10) { sample.one.nth.of.rows(df, fraction) } collect.df.samples(crdOrig, a.10th.of.all.rows) collect.df.samples(crdOrig, every.10th.rows) </code></pre>

Efficient sampling from nested lists

Tags:

performance

r

lapply

nested

I have a list of lists, containing data.frames, from which I want to select only a few rows. I can achieve it in a for-loop, where I create a sequence based on the amount of rows and select only row indices according to that sequence.

But if I have deeper nested lists it doesn't work anymore. I am also sure, that there is a better way of doing that without a loop.

What would be an efficient and generic approach to sample from nested lists, that vary in their dimensions and contain data.frames or matrices?

## Dummy Data
n1=100;n2=300;n3=100
crdOrig <- list(
  list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

## Code to opimize
FiltRef <- list()
filterBy = 10
for (r in 1:length(crdOrig)) { 
  tmp <- do.call(rbind, crdOrig[[r]])
  filterInd <- seq(1,nrow(tmp), by = filterBy)
  FiltRef[[r]] <- tmp[filterInd,]
}
crdResult <- do.call(rbind, FiltRef)

# Plotting
crdOrigPl <- do.call(rbind, unlist(crdOrig, recursive = F))
plot(crdOrigPl[,1], crdOrigPl[,2], col="red", pch=20)
points(crdResult[,1], crdResult[,2], col="green", pch=20)

The code above works also if a list contains several data.frames (data below).

## Dummy Data (Multiple DF)
crdOrig <- list(
  list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)),
       data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

But if a list contains multiple lists, it throws an error trying to bind the result (FiltRef) together.

The result can be a data.frame with 2 columns (x,y) - like crdResult or a one dimensional list like FiltRef (from the first example)

## Dummy Data (Multiple Lists)
crdOrig <- list(
  list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
       list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

+1 and Thank you all for your brilliant answers! They all work and there is a lot to learn from each one of them. I will give this one to @Gwang-Jin Kim as his solution is the most flexible and extensive, although they all deserve to be checked!

968

asked Jun 03 '18 16:06

SeGa

2 Answers

I would just flatten the whole darn thing and work on a clean list.

library(rlist)
out <- list.flatten(y)

# prepare a vector for which columns belong together
vc <- rep(1:(length(out)/2), each = 2)
vc <- split(1:length(vc), vc)

# prepare the final list
ll <- vector("list", length(unique(vc)))
for (i in 1:length(vc)) {
  ll[[i]] <- as.data.frame(out[vc[[i]]])
}

result <- lapply(ll, FUN = function(x) {
  x[sample(1:nrow(x), size = 10, replace = FALSE), ]
})

do.call(rbind, result)

           x        y
98  10.32912 52.87113
52  16.42912 46.07026
92  18.85397 46.26403
90  12.04884 57.79290
23  18.20997 40.57904
27  18.98340 52.55919
...

answered Oct 31 '22 15:10

Roman Luštrik

Preparation and implementation of flatten

Well, there are many other answers which are in principle the same.

I meanwhile implemented for fun the flattening of nested lists.

Since I am thinking in Lisp:

Implemented first car and cdr from lisp.

car <- function(l) {
  if(is.list(l)) {
    if (null(l)) {
      list()
    } else {
      l[[1]]
    }
  } else {
    error("Not a list.")
  }
}

cdr <- function(l) {
  if (is.list(l)) {
    if (null(l) || length(l) == 1) {
      list()
    } else {
      l[2:length(l)]
    }
  } else {
    error("Not a list.")
  }
}

Some predicate functions:

null <- function(l) length(l) == 0   
# this is Lisp's `null` checking whether list is empty (`length(l) == 0`)
# R's `is.null()` checks for the value NULL and not `length(obj) == 0`

# upon @Martin Morgan's comment removed other predicate functions
# thank you @Martin Morgan!
# instead using `is.data.frame()` and `is.list()`, since they are
# not only already there but also safer.

Which are necessary to build flatten (for data frame lists)

flatten <- function(nested.list.construct) {
  # Implemented Lisp's flatten tail call recursively. (`..flatten()`)
  # Instead of (atom l) (is.df l).
  ..flatten <- function(l, acc.l) { 
    if (null(l)) {
      acc.l
    } else if (is.data.frame(l)) {   # originally one checks here for is.atom(l)
      acc.l[[length(acc.l) + 1]] <- l
      acc.l # kind of (list* l acc.l)
    } else {
      ..flatten(car(l), ..flatten(cdr(l), acc.l))
    }
  }
  ..flatten(nested.list.construct, list())
}

# an atom is in the widest sence a non-list object

After this, the actual function is defined using a sampling function.

Defining sampling function

# helper function
nrow <- function(df) dim(df)[1L]

# sampling function
sample.one.nth.of.rows <- function(df, fraction = 1/10) {
  # Randomly selects a fraction of the rows of a data frame
  nr <- nrow(df) 
  df[sample(nr, fraction * nr), , drop = FALSE]
}

The actual collector function (from nested data-frame-lists)

collect.df.samples <- function(df.list.construct, fraction = 1/10) {
  do.call(rbind, 
         lapply(flatten(df.list.construct), 
                function(df) sample.one.nth.of.rows(df, fraction)
               )
        )
}
# thanks for the improvement with `do.call(rbind, [list])` @Ryan!
# and the hint that `require(data.table)`
# `data.table::rbindlist([list])` would be even faster.

collect.df.samples first flattens the nested list construct of data frames df.list.construct to a flat list of data frames. It applies the function sample.one.nth.of.rows to each elements of the list (lapply). There by it produces a list of sampled data frames (which contain the fraction - here 1/10th of the original data frame rows). These sampled data frames are rbinded across the list. The resulting data frame is returned. It consists of the sampled rows of each of the data frames.

Testing on example

## Dummy Data (Multiple Lists)
n1=100;n2=300;n3=100
crdOrig <- list(
  list(list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60))),
       list(data.frame(x = runif(n1,10,20), y = runif(n1,40,60)))),
  list(data.frame(x = runif(n2,10,20), y = runif(n2,40,60))),
  list(data.frame(x = runif(n3,10,20), y = runif(n3,40,60)))
)

collect.df.samples(crdOrig, fraction = 1/10)

Refactoring for later modifications

By writing the collect.df.samples function to:

# sampler function
sample.10th.fraction <- function(df) sample.one.nth.of.rows(df, fraction = 1/10)

# refactored:
collect.df.samples <- 
  function(df.list.construct, 
           df.sampler.fun = sample.10th.fraction) {
  do.call(rbind, 
          lapply(flatten(df.list.construct), df.sampler.fun))
}

One can make the sampler function replace-able. (And if not: By changing the fraction parameter, one can enhance or reduce amount of rows collected from each data frame.)

The sampler function is in this definition easily exchangable

For choosing every nth (e.g. every 10th) row in the data frame, instead of a random sampling, you could e.g. use the sampler function:

df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]

and input it as df.sampler.fun = in collect.df.samples. Then, this function will be applied to every data frame in the nested df list object and collected to one data frame.

every.10th.rows <- function(df, nth = 10) {
  df[seq(from=1, to=nrow(df), by = nth), , drop = FALSE]
}

a.10th.of.all.rows <- function(df, fraction = 1/10) {
  sample.one.nth.of.rows(df, fraction)
}

collect.df.samples(crdOrig, a.10th.of.all.rows)
collect.df.samples(crdOrig, every.10th.rows)

125

answered Oct 31 '22 15:10

Gwang-Jin Kim

Related questions
                            
                                remove emoji from string in R
                            
                                Change axis breaks without defining sequence - ggplot
                            
                                Using an alternate compiler for Travis-CI R project builds
                            
                                ggplot2: have common facet bar in outer facet panel in 3-way plot
                            
                                Assign 1 in a matrix from a list of coordinates
                            
                                Undo git commit in Rstudio that is too big to push
                            
                                Extract sub-matrices from binary matrix in R
                            
                                count positive negative values in column by group
                            
                                How to use formula in R to exclude main effect but retain interaction
                            
                                R CMD Check: Skip 'checking re-building of vignette outputs'
                            
                                Creating drill down report in R Shiny
                            
                                How to reset a value of fileInput in Shiny?
                            
                                dynamically adjust height and/or width of shiny-plotly output based on window size
                            
                                Concatenating two vectors in R [duplicate]
                            
                                Why does apt-get install r-base install 3.2.3 instead of 3.4.0 in R?
                            
                                How to open .rdb file using R
                            
                                Remove constant columns with or without NAs
                            
                                R: Using a string as an argument to mutate verb in dplyr
                            
                                rmarkdown::render() in a loop - cannot allocate vector of size
                            
                                Summarizing by dynamic column name in dplyr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With