I am attempting to implement a function that needs to be incredibly fast, mostly because it is working on huge data frames many times over and over.
R has always left me befuddled as to why it is sometimes a little slow, and why it is at other times ridiculously slow. (It is unfortunately never fast.)
Regardless, I have always assumed that, when possible, things could run much faster when pushed into an apply, sapply, or lapply somehow, instead of put into a loop. I recently ran across an example that makes me think there is something more going on under the hood that, if I were to understand it, might help me greatly with future optimizations.
Here are some calculations that I ran on my relatively powerful Ubuntu Linux machine:
system.time(sapply(1:1e5, sum))
user system elapsed
35.130 0.000 35.128
system.time(sapply(1:1e5, cumsum))
user system elapsed
0.110 0.000 0.108
Yes, you are reading those numbers correctly: cumsum, which creates an array of the cumulative sum, is orders of magnitude faster than just providing me the simple sum. (It'd be great if someone else could verify these results on their machine!)
I don't see how this is possible, except if the implementations differ drastically. Assuming they do differ drastically, I'd like to know in what fashion, so that I could look out for certain functions to avoid when looking for speed. (For core functions, I do not know how to see their source. The standard means of just typing in the function name without any parentheses does not work for core functions.)
Thanks so much!
The sum() function in R to find the sum of the values in the vector. This tutorial shows how to find the sum of the values, the sum of a particular row and column, and also how to get the summation value of each row and column in the dataset. The important thing is to consider the NA values or not.
Following more or less the instructions for using operf I created a file with the single line sapply(1:1e5, sum)
and ran
$ operf ~/bin/R-3-1-branch/bin/R -f sum.R
$ opreport -l ~/bin/R-3-1-branch/lib/libR.so |less
producing
CPU: Intel Sandy Bridge microarchitecture, speed 2.401e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
835882 93.0929 libR.so RunGenCollect
27731 3.0884 libR.so SortNodes
9323 1.0383 libR.so AgeNodeAndChildren
2038 0.2270 libR.so CheckFinalizers
1593 0.1774 libR.so Rf_allocVector3
1222 0.1361 libR.so duplicate1
...
etc. Most of the time is spent in the garbage collector (RunGenCollect
-- run the generational garbage collector). So I ran
$ R -d gdb R
(gdb) run
> sapply(1:1e5, sum)
^C
(gdb) break RunGenCollect
(gdb) continue
Continuing.
Breakpoint 1, RunGenCollect (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:1504
1504 bad_sexp_type_seen = 0;
(gdb) where
which produced
#0 RunGenCollect (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:1504
#1 0x00007ffff789d354 in R_gc_internal (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:2825
#2 0x00007ffff789e99b in Rf_allocVector3 (type=13, length=100000, allocator=0x0) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:2563
#3 0x00007ffff788e1a5 in Rf_allocVector (type=13, length=100000) at /home/mtmorgan/src/R-3-1-branch/src/include/Rinlinedfuns.h:189
#4 0x00007ffff7831787 in duplicate1 (s=0x7ffff3b0b010, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:335
#5 0x00007ffff783371a in duplicate_child (s=0x7ffff3b0b010, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:199
#6 0x00007ffff783357a in duplicate_list (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:261
#7 0x00007ffff7830fc2 in duplicate1 (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:308
#8 0x00007ffff783371a in duplicate_child (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:199
#9 0x00007ffff783357a in duplicate_list (s=0x2c98a88, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:261
#10 0x00007ffff7830fc2 in duplicate1 (s=0x2c98a88, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:308
#11 0x00007ffff7830c7f in Rf_duplicate (s=0x2c98a88) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:132
#12 0x00007ffff79257f4 in do_summary (call=0x2c98a88, op=0x6259a0, args=0x303cf88, env=0x2c97f48) at /home/mtmorgan/src/R-3-1-branch/src/main/summary.c:462
...
and the relevant line here is line 462
(gdb) up 12
#12 0x00007ffff79257f4 in do_summary (call=0x2c98a88, op=0x6259a0, args=0x303cf88, env=0x2c97f48) at /home/mtmorgan/src/R-3-1-branch/src/main/summary.c:462
462 PROTECT(call2 = duplicate(call));
(gdb) list
457 return ans;
458 }
459
460 /* match to foo(..., na.rm=FALSE) */
461 PROTECT(args = fixup_NaRm(args));
462 PROTECT(call2 = duplicate(call));
463 SETCDR(call2, args);
464
465 if (DispatchGroup("Summary", call2, op, args, env, &ans)) {
466 UNPROTECT(2);
The call is being duplicated
(gdb) call Rf_PrintValue(call)
FUN(1:100000[[5339L]], ...)
for each iteration of the loop, triggering garbage collection. Similar code is not executed for cumsum. It has been this way for a long time and for reasons that are not 100% apparent
$ svn annotate ~/src/R-3-1-branch/src/main/summary.c |less
...
42643 ripley /* match to foo(..., na.rm=FALSE) */
42643 ripley PROTECT(args = fixup_NaRm(args));
42643 ripley PROTECT(call2 = duplicate(call));
42643 ripley SETCDR(call2, args)
...
$ svn log -r42643
------------------------------------------------------------------------
r42643 | ripley | 2007-08-25 23:09:50 -0700 (Sat, 25 Aug 2007) | 1 line
make the rest of the group generics primitive
------------------------------------------------------------------------
It would be interesting to take this up on the R-devel mailing list. It is not that sum
is particularly slow, but rather that the calls to the garbage collector come to dominate execution time.
Hmm, on reflection it turns out that
sapply(1:1e5, function(x) sum(x))
is running in the same ballpark as cumsum
. I think it's because the duplicate
at line 462 in the original version is making a copy of the 1e5 elements in preparation for selecting the ith element to sum. In contrast, in function(x) sum(x)
the vector has already been subset, so the duplication is only of the ith element. Duplicating the original vector also explains why 1e5 elements is so much slower than 1e4 elements, and why as.list(1:1e5)
is relatively performant (only the list element is actually duplicated, or perhaps not even that). Duplication during the call to sum
has something to do with the fact that it belongs to the (S3) Summary
group generic, see ?"group generic"
.
Having just joined this thing, apparently I do not have enough reputation to comment on Martin's post, but I submitted a patch here to resolve the issue. Actually two patches. The first avoids duplication in almost every case. The second, much simpler patch just shallow duplicates, so that the pairlist is duplicated but not the big vector in the call. Another way to fix it would be to do the equivalent of lapply(1:1e5, function(x) sum(x)), i.e., only have the symbol in the call. However, that would interfere with do_lapply's attempt at optimization by skipping some evaluation with each iteration.
Update: the second patch has been applied to R-devel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With