So this is kind of a trivial question, but it's bugging me that I can't answer it, and perhaps the answer will teach me some more details about how R works.
The title says it all: how does R parse ->
, the obscure right-side assignment function?
My usual tricks to dive into this failed:
`->`
Error: object
->
not found
getAnywhere("->")
no object named
->
was found
And we can't call it directly:
`->`(3,x)
Error: could not find function
"->"
But of course, it works:
(3 -> x) #assigns the value 3 to the name x # [1] 3
It appears R knows how to simply reverse the arguments, but I thought the above approaches would surely have cracked the case:
pryr::ast(3 -> y) # \- () # \- `<- #R interpreter clearly flipped things around # \- `y # (by the time it gets to `ast`, at least...) # \- 3 # (note: this is because `substitute(3 -> y)` # # already returns the reversed version)
Compare this to the regular assignment operator:
`<-` .Primitive("<-") `<-`(x, 3) #assigns the value 3 to the name x, as expected
?"->"
, ?assignOps
, and the R Language Definition all simply mention it in passing as the right assignment operator.
But there's clearly something unique about how ->
is used. It's not a function/operator (as the calls to getAnywhere
and directly to `->`
seem to demonstrate), so what is it? Is it completely in a class of its own?
Is there anything to learn from this besides "->
is completely unique within the R language in how it's interpreted and handled; memorize and move on"?
As you all know, R comes from S. But you might not know a lot about S (I don't). This language used <- as an assignment operator. It's partly because it was inspired by a language called APL, which also had this sign for assignment.
So why the arrow? Apparently, this is legacy from APL, a really old programming language. R (created in the early nineties) is actually a modern implementation of S (created in the mid-70's), and S is heavily influenced by APL.
Let me preface this by saying I know absolutely nothing about how parsers work. Having said that, line 296 of gram.y defines the following tokens to represent assignment in the (YACC?) parser R uses:
%token LEFT_ASSIGN EQ_ASSIGN RIGHT_ASSIGN LBB
Then, on lines 5140 through 5150 of gram.c, this looks like the corresponding C code:
case '-': if (nextchar('>')) { if (nextchar('>')) { yylval = install_and_save2("<<-", "->>"); return RIGHT_ASSIGN; } else { yylval = install_and_save2("<-", "->"); return RIGHT_ASSIGN; } }
Finally, starting on line 5044 of gram.c, the definition of install_and_save2
:
/* Get an R symbol, and set different yytext. Used for translation of -> to <-. ->> to <<- */ static SEXP install_and_save2(char * text, char * savetext) { strcpy(yytext, savetext); return install(text); }
So again, having zero experience working with parsers, it seems that ->
and ->>
are translated directly into <-
and <<-
, respectively, at a very low level in the interpretation process.
You brought up a very good point in asking how the parser "knows" to reverse the arguments to ->
- considering that ->
appears to be installed into the R symbol table as <-
- and thus be able to correctly interpret x -> y
as y <- x
and not x <- y
. The best I can do is provide further speculation as I continue to come across "evidence" to support my claims. Hopefully some merciful YACC expert will stumble on this question and provide a little insight; I'm not going to hold my breath on that, though.
Back to lines 383 and 384 of gram.y, this looks like some more parsing logic related to the aforementioned LEFT_ASSIGN
and RIGHT_ASSIGN
symbols:
| expr LEFT_ASSIGN expr { $$ = xxbinary($2,$1,$3); setId( $$, @$); } | expr RIGHT_ASSIGN expr { $$ = xxbinary($2,$3,$1); setId( $$, @$); }
Although I can't really make heads or tails of this crazy syntax, I did notice that the second and third arguments to xxbinary
are swapped to WRT LEFT_ASSIGN
(xxbinary($2,$1,$3)
) and RIGHT_ASSIGN
(xxbinary($2,$3,$1)
).
Here's what I'm picturing in my head:
LEFT_ASSIGN
Scenario: y <- x
$2
is the second "argument" to the parser in the above expression, i.e. <-
$1
is the first; namely y
$3
is the third; x
Therefore, the resulting (C?) call would be xxbinary(<-, y, x)
.
Applying this logic to RIGHT_ASSIGN
, i.e. x -> y
, combined with my earlier conjecture about <-
and ->
getting swapped,
$2
gets translated from ->
to <-
$1
is x
$3
is y
But since the result is xxbinary($2,$3,$1)
instead of xxbinary($2,$1,$3)
, the result is still xxbinary(<-, y, x)
.
Building off of this a little further, we have the definition of xxbinary
on line 3310 of gram.c:
static SEXP xxbinary(SEXP n1, SEXP n2, SEXP n3) { SEXP ans; if (GenerateCode) PROTECT(ans = lang3(n1, n2, n3)); else PROTECT(ans = R_NilValue); UNPROTECT_PTR(n2); UNPROTECT_PTR(n3); return ans; }
Unfortunately I could not find a proper definition of lang3
(or its variants lang1
, lang2
, etc...) in the R source code, but I'm assuming that it is used for evaluating special functions (i.e. symbols) in a way that is synchronized with the interpreter.
Updates I'll try to address some of your additional questions in the comments as best I can given my (very) limited knowledge of the parsing process.
1) Is this really the only object in R that behaves like this?? (I've got in mind the John Chambers quote via Hadley's book: "Everything that exists is an object. Everything that happens is a function call." This clearly lies outside that domain -- is there anything else like this?
First, I agree that this lies outside of that domain. I believe Chambers' quote concerns the R Environment, i.e. processes that are all taking place after this low level parsing phase. I'll touch on this a little bit more below, however. Anyways, the only other example of this sort of behavior I could find is the **
operator, which is a synonym for the more common exponentiation operator ^
. As with right assignment, **
doesn't seem to be "recognized" as a function call, etc... by the interpreter:
R> `->` #Error: object '->' not found R> `**` #Error: object '**' not found
I found this because it's the only other case where install_and_save2
is used by the C parser:
case '*': /* Replace ** by ^. This has been here since 1998, but is undocumented (at least in the obvious places). It is in the index of the Blue Book with a reference to p. 431, the help for 'Deprecated'. S-PLUS 6.2 still allowed this, so presumably it was for compatibility with S. */ if (nextchar('*')) { yylval = install_and_save2("^", "**"); return '^'; } else yylval = install_and_save("*"); return c;
2) When exactly does this happen? I've got in mind that substitute(3 -> y) has already flipped the expression; I couldn't figure out from the source what substitute does that would have pinged the YACC...
Of course I'm still speculating here, but yes, I think we can safely assume that when you call substitute(3 -> y)
, from the perspective of the substitute function, the expression always was y <- 3
; e.g. the function is completely unaware that you typed 3 -> y
. do_substitute
, like 99% of the C functions used by R, only handles SEXP
arguments - an EXPRSXP
in the case of 3 -> y
(== y <- 3
), I believe. This is what I was alluding to above when I made a distinction between the R Environment and the parsing process. I don't think there is anything that specifically triggers the parser to spring into action - but rather everything you input into the interpreter gets parsed. I did a little more reading about the YACC / Bison parser generator last night, and as I understand it (a.k.a. don't bet the farm on this), Bison uses the grammar you define (in the .y
file(s)) to generate a parser in C - i.e. a C function which does the actual parsing of input. In turn, everything you input in an R session is first processed by this C parsing function, which then delegates the appropriate action to be taken in the R Environment (I'm using this term very loosely by the way). During this phase, lhs -> rhs
will get translated to rhs <- lhs
, **
to ^
, etc... For example, this is an excerpt from one of the tables of primitive functions in names.c:
/* Language Related Constructs */ /* Primitives */ {"if", do_if, 0, 200, -1, {PP_IF, PREC_FN, 1}}, {"while", do_while, 0, 100, 2, {PP_WHILE, PREC_FN, 0}}, {"for", do_for, 0, 100, 3, {PP_FOR, PREC_FN, 0}}, {"repeat", do_repeat, 0, 100, 1, {PP_REPEAT, PREC_FN, 0}}, {"break", do_break, CTXT_BREAK, 0, 0, {PP_BREAK, PREC_FN, 0}}, {"next", do_break, CTXT_NEXT, 0, 0, {PP_NEXT, PREC_FN, 0}}, {"return", do_return, 0, 0, -1, {PP_RETURN, PREC_FN, 0}}, {"function", do_function, 0, 0, -1, {PP_FUNCTION,PREC_FN, 0}}, {"<-", do_set, 1, 100, -1, {PP_ASSIGN, PREC_LEFT, 1}}, {"=", do_set, 3, 100, -1, {PP_ASSIGN, PREC_EQ, 1}}, {"<<-", do_set, 2, 100, -1, {PP_ASSIGN2, PREC_LEFT, 1}}, {"{", do_begin, 0, 200, -1, {PP_CURLY, PREC_FN, 0}}, {"(", do_paren, 0, 1, 1, {PP_PAREN, PREC_FN, 0}},
You will notice that ->
, ->>
, and **
are not defined here. As far as I know, R primitive expressions such as <-
and [
, etc... are the closest interaction the R Environment ever has with any underlying C code. What I am suggesting is that by this stage in process (from you typing a set characters into the interpreter and hitting 'Enter', up through the actual evaluation of a valid R expression), the parser has already worked its magic, which is why you can't get a function definition for ->
or **
by surrounding them with backticks, as you typically can.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With