Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, why is sum so slow compared to others, such as cumsum?

Tags:

performance

r

I am attempting to implement a function that needs to be incredibly fast, mostly because it is working on huge data frames many times over and over.

R has always left me befuddled as to why it is sometimes a little slow, and why it is at other times ridiculously slow. (It is unfortunately never fast.)

Regardless, I have always assumed that, when possible, things could run much faster when pushed into an apply, sapply, or lapply somehow, instead of put into a loop. I recently ran across an example that makes me think there is something more going on under the hood that, if I were to understand it, might help me greatly with future optimizations.

Here are some calculations that I ran on my relatively powerful Ubuntu Linux machine:

system.time(sapply(1:1e5, sum))
user  system elapsed
35.130   0.000  35.128


system.time(sapply(1:1e5, cumsum))
user  system elapsed
0.110   0.000   0.108

Yes, you are reading those numbers correctly: cumsum, which creates an array of the cumulative sum, is orders of magnitude faster than just providing me the simple sum. (It'd be great if someone else could verify these results on their machine!)

I don't see how this is possible, except if the implementations differ drastically. Assuming they do differ drastically, I'd like to know in what fashion, so that I could look out for certain functions to avoid when looking for speed. (For core functions, I do not know how to see their source. The standard means of just typing in the function name without any parentheses does not work for core functions.)

Thanks so much!

like image 470
Mike Williamson Avatar asked May 08 '14 00:05

Mike Williamson


People also ask

What does sum () do in R?

The sum() function in R to find the sum of the values in the vector. This tutorial shows how to find the sum of the values, the sum of a particular row and column, and also how to get the summation value of each row and column in the dataset. The important thing is to consider the NA values or not.


2 Answers

Following more or less the instructions for using operf I created a file with the single line sapply(1:1e5, sum) and ran

$ operf ~/bin/R-3-1-branch/bin/R -f sum.R
$ opreport -l ~/bin/R-3-1-branch/lib/libR.so |less

producing

CPU: Intel Sandy Bridge microarchitecture, speed 2.401e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
835882   93.0929  libR.so                  RunGenCollect
27731     3.0884  libR.so                  SortNodes
9323      1.0383  libR.so                  AgeNodeAndChildren
2038      0.2270  libR.so                  CheckFinalizers
1593      0.1774  libR.so                  Rf_allocVector3
1222      0.1361  libR.so                  duplicate1
...

etc. Most of the time is spent in the garbage collector (RunGenCollect -- run the generational garbage collector). So I ran

$ R -d gdb R
(gdb) run
> sapply(1:1e5, sum)
^C
(gdb) break RunGenCollect
(gdb) continue
Continuing.

Breakpoint 1, RunGenCollect (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:1504
1504        bad_sexp_type_seen = 0;
(gdb) where

which produced

#0  RunGenCollect (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:1504
#1  0x00007ffff789d354 in R_gc_internal (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:2825
#2  0x00007ffff789e99b in Rf_allocVector3 (type=13, length=100000, allocator=0x0) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:2563
#3  0x00007ffff788e1a5 in Rf_allocVector (type=13, length=100000) at /home/mtmorgan/src/R-3-1-branch/src/include/Rinlinedfuns.h:189
#4  0x00007ffff7831787 in duplicate1 (s=0x7ffff3b0b010, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:335
#5  0x00007ffff783371a in duplicate_child (s=0x7ffff3b0b010, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:199
#6  0x00007ffff783357a in duplicate_list (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:261
#7  0x00007ffff7830fc2 in duplicate1 (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:308
#8  0x00007ffff783371a in duplicate_child (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:199
#9  0x00007ffff783357a in duplicate_list (s=0x2c98a88, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:261
#10 0x00007ffff7830fc2 in duplicate1 (s=0x2c98a88, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:308
#11 0x00007ffff7830c7f in Rf_duplicate (s=0x2c98a88) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:132
#12 0x00007ffff79257f4 in do_summary (call=0x2c98a88, op=0x6259a0, args=0x303cf88, env=0x2c97f48) at /home/mtmorgan/src/R-3-1-branch/src/main/summary.c:462
...

and the relevant line here is line 462

(gdb) up 12
#12 0x00007ffff79257f4 in do_summary (call=0x2c98a88, op=0x6259a0, args=0x303cf88, env=0x2c97f48) at /home/mtmorgan/src/R-3-1-branch/src/main/summary.c:462
462     PROTECT(call2 = duplicate(call));
(gdb) list
457     return ans;
458     }
459 
460     /* match to foo(..., na.rm=FALSE) */
461     PROTECT(args = fixup_NaRm(args));
462     PROTECT(call2 = duplicate(call));
463     SETCDR(call2, args);
464 
465     if (DispatchGroup("Summary", call2, op, args, env, &ans)) {
466     UNPROTECT(2);

The call is being duplicated

(gdb) call Rf_PrintValue(call)
FUN(1:100000[[5339L]], ...)

for each iteration of the loop, triggering garbage collection. Similar code is not executed for cumsum. It has been this way for a long time and for reasons that are not 100% apparent

$ svn annotate ~/src/R-3-1-branch/src/main/summary.c |less
...
 42643     ripley     /* match to foo(..., na.rm=FALSE) */
 42643     ripley     PROTECT(args = fixup_NaRm(args));
 42643     ripley     PROTECT(call2 = duplicate(call));
 42643     ripley     SETCDR(call2, args)
...
$ svn log -r42643
------------------------------------------------------------------------
r42643 | ripley | 2007-08-25 23:09:50 -0700 (Sat, 25 Aug 2007) | 1 line

make the rest of the group generics primitive
------------------------------------------------------------------------

It would be interesting to take this up on the R-devel mailing list. It is not that sum is particularly slow, but rather that the calls to the garbage collector come to dominate execution time.

Hmm, on reflection it turns out that

sapply(1:1e5, function(x) sum(x))

is running in the same ballpark as cumsum. I think it's because the duplicate at line 462 in the original version is making a copy of the 1e5 elements in preparation for selecting the ith element to sum. In contrast, in function(x) sum(x) the vector has already been subset, so the duplication is only of the ith element. Duplicating the original vector also explains why 1e5 elements is so much slower than 1e4 elements, and why as.list(1:1e5) is relatively performant (only the list element is actually duplicated, or perhaps not even that). Duplication during the call to sum has something to do with the fact that it belongs to the (S3) Summary group generic, see ?"group generic".

like image 192
Martin Morgan Avatar answered Oct 08 '22 09:10

Martin Morgan


Having just joined this thing, apparently I do not have enough reputation to comment on Martin's post, but I submitted a patch here to resolve the issue. Actually two patches. The first avoids duplication in almost every case. The second, much simpler patch just shallow duplicates, so that the pairlist is duplicated but not the big vector in the call. Another way to fix it would be to do the equivalent of lapply(1:1e5, function(x) sum(x)), i.e., only have the symbol in the call. However, that would interfere with do_lapply's attempt at optimization by skipping some evaluation with each iteration.

Update: the second patch has been applied to R-devel.

like image 24
Michael Lawrence Avatar answered Oct 08 '22 08:10

Michael Lawrence