I'm looking for a speedy solution for randomly subsetting vectors nested in a list. If we simulate the following data, we get a list <code>l</code> that holds 3 million vectors inside, each one is of length 5. But I want the length of each vector to vary. So I thought I should apply a function that randomly subsets each vector. The problem is, this method is not as speedy as I wished. simulate data: the list <code>l</code> <pre class="prettyprint lang-r prettyprint-override"><code>library(stringi) set.seed(123) vec_n <- 15e6 vec_vals <- 1:vec_n vec_names <- stringi::stri_rand_strings(vec_n, 5) my_named_vec <- setNames(vec_vals, vec_names) split_func <- function(x, n) { unname(split(x, rep_len(1:n, length(x)))) } l <- split_func(my_named_vec, n = vec_n / 5) head(l) #> [[1]] #> HmPsw Qk8NP Quo3T 8f0GH nZmjN #> 1 3000001 6000001 9000001 12000001 #> #> [[2]] #> 2WtYS ZaHFl 6YjId jbGuA tAG65 #> 2 3000002 6000002 9000002 12000002 #> #> [[3]] #> xSgZ6 jM5Uw ujPOc CTV5F 5JRT5 #> 3 3000003 6000003 9000003 12000003 #> #> [[4]] #> tF2Kx r4ZCI Ooklo VOLHU M6z6H #> 4 3000004 6000004 9000004 12000004 #> #> [[5]] #> tgdze w8d1B FYERK jlClo NQfsF #> 5 3000005 6000005 9000005 12000005 #> #> [[6]] #> hXaH9 gsY1u CjBwC Oqqty dxJ4c #> 6 3000006 6000006 9000006 12000006 </code></pre> Now that we have <code>l</code>, I wish to subset each vector randomly: meaning that the number of elements being subsetted (per vector) will be random. So one option is to set the following utility function: <pre class="prettyprint lang-r prettyprint-override"><code>randomly_subset_vec <- function(x) { my_range <- 1:length(x) x[-sample(my_range, sample(my_range))] } lapply(head(l), randomly_subset_vec) #> [[1]] #> Quo3T #> 6000001 #> #> [[2]] #> 6YjId jbGuA #> 6000002 9000002 #> #> [[3]] #> xSgZ6 jM5Uw ujPOc CTV5F #> 3 3000003 6000003 9000003 #> #> [[4]] #> Ooklo #> 6000004 #> #> [[5]] #> named integer(0) #> #> [[6]] #> CjBwC Oqqty dxJ4c #> 6000006 9000006 12000006 </code></pre> But running this procedure over the entire <code>l</code> takes forever. I've tried using <code>rrapply</code> which is a fast package for dealing with lists, and it takes "only" 110 seconds on my machine. <pre class="prettyprint lang-r prettyprint-override"><code>library(rrapply) library(tictoc) tic() l_subsetted <- rrapply(object = l, f = randomly_subset_vec) toc() #> 110.23 sec elapsed </code></pre> I will be happy with either of the following: <ol> <li>Is there a speedier alternative to: <pre class="prettyprint lang-r prettyprint-override"><code>rrapply(object = l, f = randomly_subset_vec) </code></pre> </li> <li>Or more generally, is there a speedier way to start with <code>my_named_vec</code> and arrive at <code>l_subsetted</code>?</li> </ol>

UPDATE 1 to fix the name behavior in <code>stack</code> for large objects Your subsets don't include the full set, so this first removes a random element from each vector, then randomly retains all other elements: <pre class="prettyprint"><code>library(stringi) set.seed(123) vec_n <- 15e6 vec_vals <- 1:vec_n vec_names <- stringi::stri_rand_strings(vec_n, 5) my_named_vec <- setNames(vec_vals, vec_names) split_func <- function(x, n) { unname(split(x, rep_len(1:n, length(x)))) } l <- split_func(my_named_vec, n = vec_n / 5) system.time({ lenl <- lengths(l) # use stack to unlist the list while keeping the originating list index for each value vec_names <- names(unlist(l)) blnKeep <- replace(sample(c(FALSE, TRUE), length(vec_names), replace = TRUE), ceiling(runif(length(l))*lenl) + c(0, head(cumsum(lenl), -1)), FALSE) temp <- stack(setNames(l, seq_along(l)))[blnKeep,] # re-list l_subsetted <- unname(split(setNames(temp$values, vec_names[blnKeep]), temp$ind)) }) #> user system elapsed #> 22.999 0.936 23.934 head(l_subsetted) #> [[1]] #> HmPsw nZmjN #> 1 12000001 #> #> [[2]] #> 2WtYS 6YjId #> 2 6000002 #> #> [[3]] #> xSgZ6 jM5Uw ujPOc #> 3 3000003 6000003 #> #> [[4]] #> tF2Kx r4ZCI #> 4 3000004 #> #> [[5]] #> FYERK NQfsF #> 6000005 12000005 #> #> [[6]] #> gsY1u #> 3000006 Created on 2021-11-01 by the reprex package (v2.0.0) </code></pre> UPDATE 2 for vectors of uniformly distributed lengths: @runr is correct in the comments that the above code will result in binomially-distributed vector lengths, while the OP's original code results in uniformly-distributed vector lengths. Below is an example of how to use the same idea to get uniformly-distributed vector lengths. The code is more complex, but the run-time seems to be a bit faster (possibly due to circumventing <code>stack</code>): <pre class="prettyprint"><code>library(stringi) set.seed(123) vec_n <- 15e6 vec_vals <- 1:vec_n vec_names <- stringi::stri_rand_strings(vec_n, 5) my_named_vec <- setNames(vec_vals, vec_names) split_func <- function(x, n) { unname(split(x, rep_len(1:n, length(x)))) } l <- split_func(my_named_vec, n = vec_n / 5) system.time({ idx <- seq_along(l) lenl <- lengths(l) ul <- unlist(l) # get a random number of elements to remove from each vector nRemove <- ceiling(runif(length(l))*lenl) nRemove2 <- nRemove blnNotEmpty <- nRemove != lenl # will the subset vector have any elements? blnKeep <- rep(TRUE, length(l)) # loop until the predetermined number of elements have been removed from each vector while (length(nRemove)) { # remove a random element from vectors that have too many ul <- ul[-(ceiling(runif(length(idx))*lenl[idx]) + c(0, head(cumsum(lenl), -1))[idx])] lenl[idx] <- lenl[idx] - 1L # decrement the vector lengths blnKeep <- nRemove != 1 idx <- idx[blnKeep] nRemove <- nRemove[blnKeep] - 1L # decrement the number of elements left to remove } l_subsetted <- rep(list(integer(0)), length(l)) l_subsetted[blnNotEmpty] <- unname(split(ul, rep.int(seq_along(l), lenl))) }) #> user system elapsed #> 18.396 0.935 19.332 head(l_subsetted) #> [[1]] #> Qk8NP Quo3T 8f0GH #> 3000001 6000001 9000001 #> #> [[2]] #> integer(0) #> #> [[3]] #> xSgZ6 ujPOc CTV5F 5JRT5 #> 3 6000003 9000003 12000003 #> #> [[4]] #> tF2Kx Ooklo VOLHU #> 4 6000004 9000004 #> #> [[5]] #> tgdze w8d1B jlClo NQfsF #> 5 3000005 9000005 12000005 #> #> [[6]] #> gsY1u CjBwC Oqqty dxJ4c #> 3000006 6000006 9000006 12000006 # check that vector lengths are uniformly-distributed (lengths of 0-4 are equally likely) table(lengths(l_subsetted)) #> #> 0 1 2 3 4 #> 599633 599041 601209 600648 599469 Created on 2021-11-02 by the reprex package (v2.0.1) </code></pre>

Very rough and I'm not particularly proud of this. I'm sure there is a more elegant way but this ran in the matter of seconds on my machine <pre class="prettyprint"><code>> # Make some fake data > out <- lapply(1:3000000, function(i){sample(LETTERS, 5, replace = FALSE)}) > out[1:5] [[1]] [1] "D" "H" "C" "Y" "V" [[2]] [1] "M" "E" "H" "G" "S" [[3]] [1] "R" "P" "O" "L" "M" [[4]] [1] "C" "U" "G" "Q" "X" [[5]] [1] "Q" "L" "W" "O" "V" > # Create list with ids to sample > id <- lapply(1:3000000, function(i){sample(1:5, sample(1:5, 1), replace = FALSE)}) > id[1:5] [[1]] [1] 2 [[2]] [1] 2 3 4 1 5 [[3]] [1] 4 [[4]] [1] 5 [[5]] [1] 1 2 > # Extract the ids from the original data using the id list. > # Like I said I'm not particularly proud of this but it gets the job > # done quick enough on my computer > out <- lapply(1:3000000, function(i){out[[i]][id[[i]]]}) > out[1:5] [[1]] [1] "H" [[2]] [1] "E" "H" "G" "M" "S" [[3]] [1] "L" [[4]] [1] "X" [[5]] [1] "Q" "L" </code></pre>

Simplify the sampling function: <pre class="prettyprint"><code>randomly_subset_vec_2 <- function(x) { my_range <- length(x) x[-sample(my_range, sample(my_range, 1))] } </code></pre> This alone can give a significant speed-up. And though I have not tested it, given the problem description, to remove some elements (minus sign before <code>sample</code>) is to keep the others. Why not extract some elements (no minus sign) thereby keeping those? <hr> Simpler and faster: To sample directly from <code>x</code> is the fastest so far. <pre class="prettyprint"><code>randomly_subset_vec_3 <- function(x) { sample(x, sample(length(x), 1)) } </code></pre>

It seems that the largest bottleneck is running all the <code>sample</code> calls, so we could try the following. One way, is the solution by Julius Vainora. First, we generate <code>funFast</code> by <code>Rcpp</code>: <pre class="prettyprint lang-r prettyprint-override"><code>library(inline) library(Rcpp) src <- ' int num = as<int>(size), x = as<int>(n); Rcpp::NumericVector vx = Rcpp::clone<Rcpp::NumericVector>(x); Rcpp::NumericVector pr = Rcpp::clone<Rcpp::NumericVector>(prob); Rcpp::NumericVector rnd = rexp(x) / pr; for(int i= 0; i<vx.size(); ++i) vx[i] = i; std::partial_sort(vx.begin(), vx.begin() + num, vx.end(), Comp(rnd)); vx = vx[seq(0, num - 1)] + 1; return vx; ' incl <- ' struct Comp{ Comp(const Rcpp::NumericVector& v ) : _v(v) {} bool operator ()(int a, int b) { return _v[a] < _v[b]; } const Rcpp::NumericVector& _v; }; ' funFast <- cxxfunction(signature(n = "Numeric", size = "integer", prob = "numeric"), src, plugin = "Rcpp", include = incl) </code></pre> Then, define an alternative to your <code>randomly_subset_vec</code> using <code>funFast</code> instead of <code>sample</code>: <pre class="prettyprint lang-r prettyprint-override"><code>'randomly_subset_vec_2' <- function(x) { range <- length(x) probs <- rep(1/range, range) o <- funFast(range, size = funFast(range, size = 1, prob = probs), prob = probs) return(x[-o]) } tic();obj <- rrapply(object = l, f = randomly_subset_vec_2);toc(); </code></pre>

Any speedier way to randomly subset vectors inside a list?

Tags:

list

r

subset

I'm looking for a speedy solution for randomly subsetting vectors nested in a list.

If we simulate the following data, we get a list l that holds 3 million vectors inside, each one is of length 5. But I want the length of each vector to vary. So I thought I should apply a function that randomly subsets each vector. The problem is, this method is not as speedy as I wished.

simulate data: the list l

library(stringi)

set.seed(123)
vec_n <- 15e6
vec_vals  <- 1:vec_n
vec_names <- stringi::stri_rand_strings(vec_n, 5)

my_named_vec <- setNames(vec_vals, vec_names)

split_func <- function(x, n) {
  unname(split(x, rep_len(1:n, length(x))))
}

l <- split_func(my_named_vec, n = vec_n / 5)

head(l)
#> [[1]]
#>    HmPsw    Qk8NP    Quo3T    8f0GH    nZmjN 
#>        1  3000001  6000001  9000001 12000001 
#> 
#> [[2]]
#>    2WtYS    ZaHFl    6YjId    jbGuA    tAG65 
#>        2  3000002  6000002  9000002 12000002 
#> 
#> [[3]]
#>    xSgZ6    jM5Uw    ujPOc    CTV5F    5JRT5 
#>        3  3000003  6000003  9000003 12000003 
#> 
#> [[4]]
#>    tF2Kx    r4ZCI    Ooklo    VOLHU    M6z6H 
#>        4  3000004  6000004  9000004 12000004 
#> 
#> [[5]]
#>    tgdze    w8d1B    FYERK    jlClo    NQfsF 
#>        5  3000005  6000005  9000005 12000005 
#> 
#> [[6]]
#>    hXaH9    gsY1u    CjBwC    Oqqty    dxJ4c 
#>        6  3000006  6000006  9000006 12000006

Now that we have l, I wish to subset each vector randomly: meaning that the number of elements being subsetted (per vector) will be random. So one option is to set the following utility function:

randomly_subset_vec <- function(x) {
  my_range <- 1:length(x)
  x[-sample(my_range, sample(my_range))]
}

lapply(head(l), randomly_subset_vec)
#> [[1]]
#>   Quo3T 
#> 6000001 
#> 
#> [[2]]
#>   6YjId   jbGuA 
#> 6000002 9000002 
#> 
#> [[3]]
#>   xSgZ6   jM5Uw   ujPOc   CTV5F 
#>       3 3000003 6000003 9000003 
#> 
#> [[4]]
#>   Ooklo 
#> 6000004 
#> 
#> [[5]]
#> named integer(0)
#> 
#> [[6]]
#>    CjBwC    Oqqty    dxJ4c 
#>  6000006  9000006 12000006

But running this procedure over the entire l takes forever. I've tried using rrapply which is a fast package for dealing with lists, and it takes "only" 110 seconds on my machine.

library(rrapply)
library(tictoc)

tic()
l_subsetted <- rrapply(object = l, f = randomly_subset_vec)
toc()
#> 110.23 sec elapsed

I will be happy with either of the following:

Is there a speedier alternative to:

rrapply(object = l, f = randomly_subset_vec)

Or more generally, is there a speedier way to start with my_named_vec and arrive at l_subsetted?

420

asked Nov 01 '21 14:11

Emman

4 Answers

UPDATE 1 to fix the name behavior in stack for large objects

Your subsets don't include the full set, so this first removes a random element from each vector, then randomly retains all other elements:

library(stringi)

set.seed(123)
vec_n <- 15e6
vec_vals  <- 1:vec_n
vec_names <- stringi::stri_rand_strings(vec_n, 5)

my_named_vec <- setNames(vec_vals, vec_names)

split_func <- function(x, n) {
  unname(split(x, rep_len(1:n, length(x))))
}

l <- split_func(my_named_vec, n = vec_n / 5)
system.time({
  lenl <- lengths(l)
  # use stack to unlist the list while keeping the originating list index for each value
  vec_names <- names(unlist(l))
  blnKeep <- replace(sample(c(FALSE, TRUE), length(vec_names), replace = TRUE), ceiling(runif(length(l))*lenl) + c(0, head(cumsum(lenl), -1)), FALSE)
  temp <- stack(setNames(l, seq_along(l)))[blnKeep,]
  # re-list
  l_subsetted <- unname(split(setNames(temp$values, vec_names[blnKeep]), temp$ind))
})
#>    user  system elapsed 
#>  22.999   0.936  23.934
head(l_subsetted)
#> [[1]]
#>    HmPsw    nZmjN 
#>        1 12000001 
#> 
#> [[2]]
#>   2WtYS   6YjId 
#>       2 6000002 
#> 
#> [[3]]
#>   xSgZ6   jM5Uw   ujPOc 
#>       3 3000003 6000003 
#> 
#> [[4]]
#>   tF2Kx   r4ZCI 
#>       4 3000004 
#> 
#> [[5]]
#>    FYERK    NQfsF 
#>  6000005 12000005 
#> 
#> [[6]]
#>   gsY1u 
#> 3000006
Created on 2021-11-01 by the reprex package (v2.0.0)

UPDATE 2 for vectors of uniformly distributed lengths:

@runr is correct in the comments that the above code will result in binomially-distributed vector lengths, while the OP's original code results in uniformly-distributed vector lengths. Below is an example of how to use the same idea to get uniformly-distributed vector lengths. The code is more complex, but the run-time seems to be a bit faster (possibly due to circumventing stack):

library(stringi)
set.seed(123)
vec_n <- 15e6
vec_vals  <- 1:vec_n
vec_names <- stringi::stri_rand_strings(vec_n, 5)
my_named_vec <- setNames(vec_vals, vec_names)
split_func <- function(x, n) {
  unname(split(x, rep_len(1:n, length(x))))
}
l <- split_func(my_named_vec, n = vec_n / 5)

system.time({
  idx <- seq_along(l)
  lenl <- lengths(l)
  ul <- unlist(l)
  # get a random number of elements to remove from each vector
  nRemove <- ceiling(runif(length(l))*lenl)
  nRemove2 <- nRemove
  blnNotEmpty <- nRemove != lenl # will the subset vector have any elements?
  blnKeep <- rep(TRUE, length(l))
  
  # loop until the predetermined number of elements have been removed from each vector
  while (length(nRemove)) {
    # remove a random element from vectors that have too many
    ul <- ul[-(ceiling(runif(length(idx))*lenl[idx]) + c(0, head(cumsum(lenl), -1))[idx])]
    lenl[idx] <- lenl[idx] - 1L # decrement the vector lengths
    blnKeep <- nRemove != 1
    idx <- idx[blnKeep]
    nRemove <- nRemove[blnKeep] - 1L # decrement the number of elements left to remove
  }
  
  l_subsetted <- rep(list(integer(0)), length(l))
  l_subsetted[blnNotEmpty] <- unname(split(ul, rep.int(seq_along(l), lenl)))
})
#>    user  system elapsed 
#>  18.396   0.935  19.332
head(l_subsetted)
#> [[1]]
#>   Qk8NP   Quo3T   8f0GH 
#> 3000001 6000001 9000001 
#> 
#> [[2]]
#> integer(0)
#> 
#> [[3]]
#>    xSgZ6    ujPOc    CTV5F    5JRT5 
#>        3  6000003  9000003 12000003 
#> 
#> [[4]]
#>   tF2Kx   Ooklo   VOLHU 
#>       4 6000004 9000004 
#> 
#> [[5]]
#>    tgdze    w8d1B    jlClo    NQfsF 
#>        5  3000005  9000005 12000005 
#> 
#> [[6]]
#>    gsY1u    CjBwC    Oqqty    dxJ4c 
#>  3000006  6000006  9000006 12000006
# check that vector lengths are uniformly-distributed (lengths of 0-4 are equally likely)
table(lengths(l_subsetted))
#> 
#>      0      1      2      3      4 
#> 599633 599041 601209 600648 599469
Created on 2021-11-02 by the reprex package (v2.0.1)

answered Oct 28 '22 04:10

jblood94

Very rough and I'm not particularly proud of this. I'm sure there is a more elegant way but this ran in the matter of seconds on my machine

> # Make some fake data
> out <- lapply(1:3000000, function(i){sample(LETTERS, 5, replace = FALSE)})
> out[1:5]
[[1]]
[1] "D" "H" "C" "Y" "V"

[[2]]
[1] "M" "E" "H" "G" "S"

[[3]]
[1] "R" "P" "O" "L" "M"

[[4]]
[1] "C" "U" "G" "Q" "X"

[[5]]
[1] "Q" "L" "W" "O" "V"

> # Create list with ids to sample
> id <- lapply(1:3000000, function(i){sample(1:5, sample(1:5, 1), replace = FALSE)})
> id[1:5]
[[1]]
[1] 2

[[2]]
[1] 2 3 4 1 5

[[3]]
[1] 4

[[4]]
[1] 5

[[5]]
[1] 1 2

> # Extract the ids from the original data using the id list.
> # Like I said I'm not particularly proud of this but it gets the job
> # done quick enough on my computer
> out <- lapply(1:3000000, function(i){out[[i]][id[[i]]]})
> out[1:5]
[[1]]
[1] "H"

[[2]]
[1] "E" "H" "G" "M" "S"

[[3]]
[1] "L"

[[4]]
[1] "X"

[[5]]
[1] "Q" "L"

answered Oct 28 '22 02:10

Dason

Simplify the sampling function:

randomly_subset_vec_2 <- function(x) {
  my_range <- length(x)
  x[-sample(my_range, sample(my_range, 1))]
}

This alone can give a significant speed-up.
And though I have not tested it, given the problem description, to remove some elements (minus sign before sample) is to keep the others. Why not extract some elements (no minus sign) thereby keeping those?

Simpler and faster: To sample directly from x is the fastest so far.

randomly_subset_vec_3 <- function(x) {
  sample(x, sample(length(x), 1))
}

answered Oct 28 '22 02:10

Rui Barradas

It seems that the largest bottleneck is running all the sample calls, so we could try the following. One way, is the solution by Julius Vainora. First, we generate funFast by Rcpp:

library(inline)
library(Rcpp)
src <- 
'
int num = as<int>(size), x = as<int>(n);
Rcpp::NumericVector vx = Rcpp::clone<Rcpp::NumericVector>(x);
Rcpp::NumericVector pr = Rcpp::clone<Rcpp::NumericVector>(prob);
Rcpp::NumericVector rnd = rexp(x) / pr;
for(int i= 0; i<vx.size(); ++i) vx[i] = i;
std::partial_sort(vx.begin(), vx.begin() + num, vx.end(), Comp(rnd));
vx = vx[seq(0, num - 1)] + 1;
return vx;
'
incl <- 
'
struct Comp{
  Comp(const Rcpp::NumericVector& v ) : _v(v) {}
  bool operator ()(int a, int b) { return _v[a] < _v[b]; }
  const Rcpp::NumericVector& _v;
};
'
funFast <- cxxfunction(signature(n = "Numeric", size = "integer", prob = "numeric"),
                       src, plugin = "Rcpp", include = incl)

Then, define an alternative to your randomly_subset_vec using funFast instead of sample:

'randomly_subset_vec_2' <- function(x) {
  range <- length(x)
  probs <- rep(1/range, range)
  
  o <- funFast(range, size = funFast(range, size = 1, prob = probs), prob = probs)
  return(x[-o])
}

tic();obj <- rrapply(object = l, f = randomly_subset_vec_2);toc();

answered Oct 28 '22 04:10

runr

Related questions
                            
                                Use actionButton to send email in RShiny
                            
                                How to add calculated columns to nested data frames (list columns) using purrr
                            
                                pandoc document conversion failed with error 1 after update to R version 3.4.2
                            
                                Select columns, skip if column not exist
                            
                                Average clustering coefficient of a network (igraph)
                            
                                Why is R dplyr::mutate inconsistent with custom functions
                            
                                How do I plot the mean instead of the median with geom_boxplot? [duplicate]
                            
                                How to store r ggplot graph as html code snippet
                            
                                Beautifying Sankey/Alluvial visualization using R
                            
                                Calculate cumsum from the end towards the beginning
                            
                                Create R function using dplyr::filter problem
                            
                                Cannot install IRkernel with install_github(): Git not installed?
                            
                                R: How can I do a conditional count in dplyr? [duplicate]
                            
                                Namespaces and generic functions in R
                            
                                Evaluate different logical conditions from string for each row
                            
                                R devtools::check LICENSE is not mentioned and other issues in DESCRIPTION FILE
                            
                                Failed to compile tex in r markdown
                            
                                In igraph in R, is it possible to create dotted lines around the vertex objects?
                            
                                How do i optimize the performance of stemming and spell check in R?
                            
                                Unable to replicate this ggplot2 plot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With