Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Studying the source code of primitive and internal R functions: How is R connected with C?

Tags:

r

Ben Bolkers' answer to this question and the article by Uwe Ligges are already very useful when I try to "decode" a primitive or internal R function. But how is a primitive R function connected with its corresponding C function? I guess that somehow .Primitive must provide this missing link. Take for example is.na:

> is.na
function (x)  .Primitive("is.na")

FUNTAB R_FunTab[] in file "names.c" contains

{"is.na",   do_isna,    0,  1,  1,  {PP_FUNCALL, PREC_FN,   0}},

which means that is.na uses the C function do_isna. do_isna is defined in file "coerce.c":

SEXP attribute_hidden do_isna(SEXP call, SEXP op, SEXP args, SEXP rho)
{
    SEXP ans, dims, names, x;
    R_xlen_t i, n;

    checkArity(op, args);
    check1arg(args, call, "x");

    if (DispatchOrEval(call, op, "is.na", args, rho, &ans, 1, 1))
    return(ans);
    PROTECT(args = ans);
#ifdef stringent_is
    if (!isList(CAR(args)) && !isVector(CAR(args)))
    errorcall_return(call, "is.na " R_MSG_list_vec);

#endif
    x = CAR(args);
    n = xlength(x);
    PROTECT(ans = allocVector(LGLSXP, n));
    if (isVector(x)) {
    PROTECT(dims = getAttrib(x, R_DimSymbol));
    if (isArray(x))
        PROTECT(names = getAttrib(x, R_DimNamesSymbol));
    else
        PROTECT(names = getAttrib(x, R_NamesSymbol));
    }
    else dims = names = R_NilValue;
    switch (TYPEOF(x)) {
    case LGLSXP:
       for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (LOGICAL(x)[i] == NA_LOGICAL);
    break;
    case INTSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (INTEGER(x)[i] == NA_INTEGER);
    break;
    case REALSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
    break;
    case CPLXSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (ISNAN(COMPLEX(x)[i].r) ||
                   ISNAN(COMPLEX(x)[i].i));
    break;
    case STRSXP:
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = (STRING_ELT(x, i) == NA_STRING);
    break;

/* Same code for LISTSXP and VECSXP : */
#define LIST_VEC_NA(s)                          \
    if (!isVector(s) || length(s) != 1)             \
        LOGICAL(ans)[i] = 0;                    \
    else {                              \
        switch (TYPEOF(s)) {                    \
        case LGLSXP:                        \
        case INTSXP:                        \
            LOGICAL(ans)[i] = (INTEGER(s)[0] == NA_INTEGER);    \
            break;                      \
        case REALSXP:                       \
            LOGICAL(ans)[i] = ISNAN(REAL(s)[0]);        \
            break;                      \
        case STRSXP:                        \
            LOGICAL(ans)[i] = (STRING_ELT(s, 0) == NA_STRING);  \
            break;                      \
        case CPLXSXP:                       \
            LOGICAL(ans)[i] = (ISNAN(COMPLEX(s)[0].r)       \
                       || ISNAN(COMPLEX(s)[0].i));  \
            break;                      \
        default:                        \
            LOGICAL(ans)[i] = 0;                \
        }                           \
    }

    case LISTSXP:
    for (i = 0; i < n; i++) {
        LIST_VEC_NA(CAR(x));
        x = CDR(x);
    }
    break;
    case VECSXP:
    for (i = 0; i < n; i++) {
        SEXP s = VECTOR_ELT(x, i);
        LIST_VEC_NA(s);
    }
    break;
    case RAWSXP:
    /* no such thing as a raw NA */
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = 0;
    break;
    default:
    warningcall(call, _("%s() applied to non-(list or vector) of type '%s'"),
            "is.na", type2char(TYPEOF(x)));
    for (i = 0; i < n; i++)
        LOGICAL(ans)[i] = 0;
    }
    if (dims != R_NilValue)
    setAttrib(ans, R_DimSymbol, dims);
    if (names != R_NilValue) {
    if (isArray(x))
        setAttrib(ans, R_DimNamesSymbol, names);
    else
        setAttrib(ans, R_NamesSymbol, names);
    }
    if (isVector(x))
    UNPROTECT(2);
    UNPROTECT(1);
    UNPROTECT(1); /*ans*/
    return ans;
}

But if we want to evaluate is.na(x=3) for example, how are the arguments call,op,args,rho generated? At least some external information must be used, x=3 is not enough. Moreover, at first glance x=3 is not used at all, which must be wrong of course:

> is.na
function (x)  .Primitive("is.na")

The R Code of .Primitive doesn't give a hint:

> .Primitive
function (name)  .Primitive(".Primitive")

Taking all this into account, it is not surprising that an apparently excellent copy isNA of is.na fails:

> isNA <- function (x)  .Primitive("is.na")
> isNA
function (x)  .Primitive("is.na")
> is.na
function (x)  .Primitive("is.na")
> isNA(x=3)
function (x)  .Primitive("is.na")
> is.na(x=3)
[1] FALSE

To put it straight: All of the C functions do_... have these arguments call,op,args,rho. By what formula are they calculated when a primitive R function is called?

like image 468
mra68 Avatar asked Mar 25 '16 13:03

mra68


1 Answers

Great question. I started R under gdb R -d gdb, set a breakpoint at do_isna, then continued R and entered is.na(3).

$ R -d gdb
(gdb) run
Starting program: /home/mtmorgan/bin/R-3-3-branch/bin/exec/R --no-save --no-restore --silent
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> ## break, cntrl-C
Program received signal SIGINT, Interrupt.
0x00007ffff722fd83 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) b do_isna
Breakpoint 1 at 0x7ffff77e0b3b: file /home/mtmorgan/src/R-3-3-branch/src/main/coerce.c, line 1982.
(gdb) continue
Continuing.

> is.na(3)

Breakpoint 1, do_isna (call=0x1838888, op=0x628218, args=0x1838770, rho=0x63f648)
    at /home/mtmorgan/src/R-3-3-branch/src/main/coerce.c:1982
1982        checkArity(op, args);
(gdb)

At the gdb prompt I asked

(gdb) where
#0  do_isna (call=0x1838888, op=0x628218, args=0x1838770, rho=0x63f648) at /home/mtmorgan/src/R-3-3-branch/src/main/coerce.c:1982
#1  0x00007ffff7869170 in Rf_eval (e=0x1838888, rho=0x63f648) at /home/mtmorgan/src/R-3-3-branch/src/main/eval.c:717
#2  0x00007ffff78b36af in Rf_ReplIteration (rho=0x63f648, savestack=0, browselevel=0, state=0x7fffffffcaf0) at /home/mtmorgan/src/R-3-3-branch/src/main/main.c:258
...

Starting from #2, Rf_ReplIteration is the REPL (read-eval-print loop) trying to evalute is.na(3). It's provided with the environment where the function is being called from. By the time it calls Rf_eval() on line 258, it knows the environment and the call

(gdb) call Rf_PrintValue(rho)
<environment: R_GlobalEnv>
(gdb) call Rf_PrintValue(thisExpr)
is.na(3)

By #1 (eval.c:717), R has figured out the values of op and tmp.

(gdb) call Rf_PrintValue(op)
function (x)  .Primitive("is.na")
(gdb) call TYPEOF(op)
$2 = 8

(type 8 is 'BUILTINSXP', from the table in Rinternals.h). It does this by finding out that e is a LANGSXP (line 614), that is.na is a SYMSXP (line 670), and that the function it references (op) is a BUILTINSXP (line 700). It then uses (line 717)

(gdb) call PRIMFUN(op)
$8 = (SEXP (*)(SEXP, SEXP, SEXP, SEXP)) 0x7ffff77e0b20 <do_isna>

to discover that it should invoke do_isna with the values it's discovered.

Hopefully that removes some of the mystery, and points to relevant parts of the code.

like image 138
Martin Morgan Avatar answered Oct 21 '22 02:10

Martin Morgan