Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing functions for lists to avoid looping in R

Tags:

performance

r

I am working with a large list of values in R. I need to apply some functions to each element of the list. The list I use is i1 and is produced by the next code:

i1=list(0)
i1[1:120000]=runif(120000,min = 10000,max = 100000)

In i1 I have to apply a some functions in order to get a new dataframe using as input each value in the list. The functions are the next: f_1 computes a new value using as input each value in i1 by using some conditions. In this function I used some conditionals in order to obtain the value. The function is the next:

f_1=function(x)
{
  y=ifelse((x/18)>20,x-(x/18),ifelse(x>20,x-20,ifelse(x==0,0,x)))
  return(y)
}

Second function is f_2. This function uses as input f_1 and it is composed of a for structure where there are 160 iterations. In this function an empty vector is created. Then, the vector grows by applying f_1 function. The final result of f_2 is a dataframe with all the elements produced in the for structure. The function is the next:

f_2=function(v)
{
  x=c()
  y=v
  x[1]=y
  for(i in 2:160)
  {
    x[i]=f_1(x[i-1])
  }
  x=x[!duplicated(x)]
  x=c(x,0)
  z=as.data.frame(t(abs(diff(x))))
  return(z)
}

Finally, to apply both f_1 and f_2 to i1 I use the package plyr in order to apply the functions to the list. I built this function for that activity:

compute=function(x)
{
  y=f_2(x)
  return(y)
}

By using compute I can apply the functions for all elements in the list. I use this code for that:

L2=llply(i1,compute)

All worked fine but it is taking long time to produce the final result:

system.time(llply(i1,compute))
   user  system elapsed 
 436.71    0.92  447.70 

I think the reason why the process is too slow has a basis in the function f_2 because it uses a loop inside it. I have looked for some ideas to avoid this structure but I do not have a clear idea how to change f_2 to be more efficient. Please, could you help with some directions about solving this issue? I have knowledge of functions but for this case I used a for inside the function to create my wished result.

Thanks for your help!

like image 814
Duck Avatar asked Mar 10 '26 05:03

Duck


2 Answers

There are several problems with your code. For example you make the classic mistake of growing objects in a loop.

However, if you are unhappy with performance of your code, you should start profiling it:

Rprof()
L2=llply(i1,compute)
Rprof(NULL)
summaryRprof()$by.self
#                       self.time self.pct total.time total.pct
#"ifelse"                    3.38    35.58       4.06     42.74
#"f_2"                       2.28    24.00       9.48     99.79
#"f_1"                       1.46    15.37       5.52     58.11
#"as.vector"                 0.86     9.05       0.86      9.05
#"as.data.frame.matrix"      0.32     3.37       1.44     15.16
#"paste0"                    0.20     2.11       0.22      2.32
#"is.na"                     0.20     2.11       0.20      2.11
#</snip>

You see that most of the time is spent in ifelse , as.vector and as.data.frame.matrix. It's not quite obvious where as.vector is called[1], but the other two are obvious.

You can get slightly better performance using if and else instead of ifelse, but it doesn't help that much. I would instead turn f1 and the for loop in f2 into compiled code using Rcpp (it's very simple with RStudio). Obviously you need the tool chain, i.e., install Rtools on Windows.

#include <Rcpp.h>
using namespace Rcpp;

double f1 (const double x) {
  if((x/18)>20) return x-(x/18); 
  if(x>20) return x-20; 
  if(x==0) return 0; 
  return x; 
}

// [[Rcpp::export]]
NumericVector f2_1 (const double init, const int n){ 
  NumericVector res(n);
  res(0) = init;
  for (int i=1; i<n; i++) res(i) = f1(res(i-1));
  return res;
}

This is written faster than coming up with a vectorized pure-R solution (provided one even exists).

We can define the rest of your f2 as:

f_2a=function(v)
{
  x = f2_1(v, 160)
  x=x[!duplicated(x)]
  x=c(x,0)
  z=abs(diff(x))
  return(z)
}

Note how I leave out t and as.data.frame because data.frames should be avoided if performance matters. They are more designed for convenience than for performance. Vectors can store the equivalent information of a one-row all-numeric data.frame and I can't imagine a good reason to return a list of one-row data.frames.

Now we call the funtion:

L2a = lapply(i1, f_2a)

Let's test if results are equal:

all.equal(L2[[1]], as.data.frame(t(L2a[[1]])))
#[1] TRUE

And now compare timings:

system.time(llply(i1,compute))
# user  system elapsed 
#13.91    0.00   13.93 

system.time(lapply(i1, f_2a))
#user  system elapsed 
#0.26    0.00    0.27

[1] It's called in a loop in as.data.frame.matrix splitting the matrix into a list of column vectors.

like image 93
Roland Avatar answered Mar 11 '26 19:03

Roland


One way to gain performance here is to use vectors/matrices.

So firstly you could either create you data as a vector by doing so

i1 = runif(120000,min = 10000,max = 100000)

or converting it into a vector like so

vector1 = unlist(i1)

and when you're done convert it back to list

list1 = list(vector1)

New Functions Structure

Having vectors you'll be able to make use of logical indexing and f_1 would look like this:

f_1=function(x)
{
  y = rep(NA,length(x)) #initialize y filled with "NA"
  y[(x/18)>20] = x[(x/18)>20] - (x[(x/18)>20]/18)
  y[x>20] = x[x>20] - 20
  y[x == 0] = x[x == 0]
  return(y)
}

This way y will have the calculated value for every x.

Optimization Points:

  • Using vectors/matrices
  • Initializing vector/matrices before hand

Other Possibilities

Comparisons used more than once (such as (x/18)>20) can be stored and and reused to gain performance. For example y[(x/18)>20] = x - (x/18) would turn into:

condition1 = (x/18)>20;
y[condition1] = x[condition1] - (x[condition1]/18)

This way the condition will be calculated only once instead of three times.

Final Notes

  • Be aware that if you do not cover all the testing possibilities you may end up with NA's in your array.
  • You could also initialize it with some other value, the default one, if have one. Just change NA to whatever data you want to fill it with. Like so y = rep(NA,length(x)) -> y = rep(0,length(x)) (filling with zeros).
  • Note that commonly <- is used as the assignment operator in R, I just used =to avoid misunderstandings.
like image 41
Gui Brunow Avatar answered Mar 11 '26 19:03

Gui Brunow