Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to substitute a for-loop with vecorization acting several thousand times per data.frame row?

Being still quite wet behind the ears concerning R and - more important - vectorization, I cannot get my head around how to speed up the code below.

The for-loop calculates a number of seeds falling onto a road for several road segments with different densities of seed-generating plants by applying a random propability for every seed. As my real data frame has ~200k rows and seed numbers are up to 300k/segment, using the example below would take several hours on my current machine.

#Example data.frame
df <- data.frame(Density=c(0,0,0,3,0,120,300,120,0,0))
#Example SeedRain vector
SeedRainDists <- c(7.72,-43.11,16.80,-9.04,1.22,0.70,16.48,75.06,42.64,-5.50)

#Calculating the number of seeds from plant densities
df$Seeds <- df$Density * 500

#Applying a probability of reaching the road for every seed
df$SeedsOnRoad <- apply(as.matrix(df$Seeds),1,function(x){
    SeedsOut <- 0
    if(x>0){
        #Summing up the number of seeds reaching a certain distance
        for(i in 1:x){
            SeedsOut <- SeedsOut +
                ifelse(sample(SeedRainDists,1,replace=T)>40,1,0)
        }
    }
    return(SeedsOut)
})

If someone might give me a hint as to how the loop could be substituted by vectorization - or maybe how the data could be organized better in the first place to improve performance - I would be very grateful!

Edit: Roland's answer showed that I may have oversimplified the question. In the for-loop I extract a random value from a distribution of distances recorded by another author (that's why I can't supply the data here). Added an exemplary vector with likely values for SeedRain distances.

like image 584
sir_husefugg Avatar asked Jan 15 '23 00:01

sir_husefugg


2 Answers

This should do about the same simulation:

df$SeedsOnRoad2 <- sapply(df$Seeds,function(x){
  rbinom(1,x,0.6)
})



#   Density  Seeds SeedsOnRoad SeedsOnRoad2
#1        0      0           0            0
#2        0      0           0            0
#3        0      0           0            0
#4        3   1500         892          877
#5        0      0           0            0
#6      120  60000       36048        36158
#7      300 150000       90031        89875
#8      120  60000       35985        35773
#9        0      0           0            0
#10       0      0           0            0
like image 116
Roland Avatar answered Jan 19 '23 11:01

Roland


One option is generate the sample() for all Seeds per row of df in a single go.

Using set.seed(1) before your loop-based code I get:

> df
   Density  Seeds SeedsOnRoad
1        0      0           0
2        0      0           0
3        0      0           0
4        3   1500         289
5        0      0           0
6      120  60000       12044
7      300 150000       29984
8      120  60000       12079
9        0      0           0
10       0      0           0

I get the same answer in a fraction of the time if I do:

set.seed(1)
tmp <- sapply(df$Seeds, 
              function(x) sum(sample(SeedRainDists, x, replace = TRUE) > 40)))

> tmp
 [1]     0     0     0   289     0 12044 29984 12079     0     0

For comparison:

df <- transform(df, GavSeedsOnRoad = tmp)
df

> df
   Density  Seeds SeedsOnRoad GavSeedsOnRoad
1        0      0           0              0
2        0      0           0              0
3        0      0           0              0
4        3   1500         289            289
5        0      0           0              0
6      120  60000       12044          12044
7      300 150000       29984          29984
8      120  60000       12079          12079
9        0      0           0              0
10       0      0           0              0

The points to note here are:

  1. try to avoid calling a function repeatedly in a loop if you the function is vectorised or can generate the entire end result with a single call. Here you were calling sample() Seeds times for each row of df, each call returning a single sample from SeedRainDists. Here I do a single sample() call asking for sample size Seeds, for each row of df - hence I call sample 10 times, your code called it 271500 times.
  2. even if you have to repeatedly call a function in a loop, remove from the loop anything that is vectorised that could be done on the entire result after the loop is done. An example here is your accumulating of SeedsOut, which is calling +() a large number of times.

    Better would have been to collect each SeedsOut in a vector, and then sum() that vector outside the loop. E.g.

    SeedsOut <- numeric(length = x)
    for(i in seq_len(x)) {
      SeedsOut[i] <- ifelse(sample(SeedRainDists,1,replace=TRUE)>40,1,0)
    }
    sum(SeedOut)
    
  3. Note that R treats a logical as if it were numeric 0s or 1s where used in any mathematical function. Hence

    sum(ifelse(sample(SeedRainDists, 100, replace=TRUE)>40,1,0))
    

    and

    sum(sample(SeedRainDists, 100, replace=TRUE)>40)
    

    would give the same result if run with the same set.seed().

There may be a fancier way of doing the sampling requiring fewer calls to sample() (and there is, sample(SeedRainDists, sum(Seeds), replace = TRUE) > 40 but then you need to take care of selecting the right elements of that vector for each row of df - not hard, just a light cumbersome), but what i show may be efficient enough?

like image 42
Gavin Simpson Avatar answered Jan 19 '23 10:01

Gavin Simpson