Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How exactly does R parse `->`, the right-assignment operator?

Tags:

r

yacc

So this is kind of a trivial question, but it's bugging me that I can't answer it, and perhaps the answer will teach me some more details about how R works.

The title says it all: how does R parse ->, the obscure right-side assignment function?

My usual tricks to dive into this failed:

`->` 

Error: object -> not found

getAnywhere("->") 

no object named -> was found

And we can't call it directly:

`->`(3,x) 

Error: could not find function "->"

But of course, it works:

(3 -> x) #assigns the value 3 to the name x # [1] 3 

It appears R knows how to simply reverse the arguments, but I thought the above approaches would surely have cracked the case:

pryr::ast(3 -> y) # \- () #   \- `<- #R interpreter clearly flipped things around #   \- `y  #  (by the time it gets to `ast`, at least...) #   \-  3  #  (note: this is because `substitute(3 -> y)`  #          #   already returns the reversed version) 

Compare this to the regular assignment operator:

`<-` .Primitive("<-")  `<-`(x, 3) #assigns the value 3 to the name x, as expected 

?"->" , ?assignOps, and the R Language Definition all simply mention it in passing as the right assignment operator.

But there's clearly something unique about how -> is used. It's not a function/operator (as the calls to getAnywhere and directly to `->` seem to demonstrate), so what is it? Is it completely in a class of its own?

Is there anything to learn from this besides "-> is completely unique within the R language in how it's interpreted and handled; memorize and move on"?

like image 624
MichaelChirico Avatar asked Jan 04 '16 20:01

MichaelChirico


People also ask

Why does R use<- for assignment?

As you all know, R comes from S. But you might not know a lot about S (I don't). This language used <- as an assignment operator. It's partly because it was inspired by a language called APL, which also had this sign for assignment.

Why does R use arrow?

So why the arrow? Apparently, this is legacy from APL, a really old programming language. R (created in the early nineties) is actually a modern implementation of S (created in the mid-70's), and S is heavily influenced by APL.


1 Answers

Let me preface this by saying I know absolutely nothing about how parsers work. Having said that, line 296 of gram.y defines the following tokens to represent assignment in the (YACC?) parser R uses:

%token      LEFT_ASSIGN EQ_ASSIGN RIGHT_ASSIGN LBB 

Then, on lines 5140 through 5150 of gram.c, this looks like the corresponding C code:

case '-':   if (nextchar('>')) {     if (nextchar('>')) {       yylval = install_and_save2("<<-", "->>");       return RIGHT_ASSIGN;     }     else {       yylval = install_and_save2("<-", "->");       return RIGHT_ASSIGN;     }   } 

Finally, starting on line 5044 of gram.c, the definition of install_and_save2:

/* Get an R symbol, and set different yytext.  Used for translation of -> to <-. ->> to <<- */ static SEXP install_and_save2(char * text, char * savetext) {     strcpy(yytext, savetext);     return install(text); } 

So again, having zero experience working with parsers, it seems that -> and ->> are translated directly into <- and <<-, respectively, at a very low level in the interpretation process.


You brought up a very good point in asking how the parser "knows" to reverse the arguments to -> - considering that -> appears to be installed into the R symbol table as <- - and thus be able to correctly interpret x -> y as y <- x and not x <- y. The best I can do is provide further speculation as I continue to come across "evidence" to support my claims. Hopefully some merciful YACC expert will stumble on this question and provide a little insight; I'm not going to hold my breath on that, though.

Back to lines 383 and 384 of gram.y, this looks like some more parsing logic related to the aforementioned LEFT_ASSIGN and RIGHT_ASSIGN symbols:

|   expr LEFT_ASSIGN expr       { $$ = xxbinary($2,$1,$3);  setId( $$, @$); } |   expr RIGHT_ASSIGN expr      { $$ = xxbinary($2,$3,$1);  setId( $$, @$); } 

Although I can't really make heads or tails of this crazy syntax, I did notice that the second and third arguments to xxbinary are swapped to WRT LEFT_ASSIGN (xxbinary($2,$1,$3)) and RIGHT_ASSIGN (xxbinary($2,$3,$1)).

Here's what I'm picturing in my head:

LEFT_ASSIGN Scenario: y <- x

  • $2 is the second "argument" to the parser in the above expression, i.e. <-
  • $1 is the first; namely y
  • $3 is the third; x

Therefore, the resulting (C?) call would be xxbinary(<-, y, x).

Applying this logic to RIGHT_ASSIGN, i.e. x -> y, combined with my earlier conjecture about <- and -> getting swapped,

  • $2 gets translated from -> to <-
  • $1 is x
  • $3 is y

But since the result is xxbinary($2,$3,$1) instead of xxbinary($2,$1,$3), the result is still xxbinary(<-, y, x).


Building off of this a little further, we have the definition of xxbinary on line 3310 of gram.c:

static SEXP xxbinary(SEXP n1, SEXP n2, SEXP n3) {     SEXP ans;     if (GenerateCode)     PROTECT(ans = lang3(n1, n2, n3));     else     PROTECT(ans = R_NilValue);     UNPROTECT_PTR(n2);     UNPROTECT_PTR(n3);     return ans; } 

Unfortunately I could not find a proper definition of lang3 (or its variants lang1, lang2, etc...) in the R source code, but I'm assuming that it is used for evaluating special functions (i.e. symbols) in a way that is synchronized with the interpreter.


Updates I'll try to address some of your additional questions in the comments as best I can given my (very) limited knowledge of the parsing process.

1) Is this really the only object in R that behaves like this?? (I've got in mind the John Chambers quote via Hadley's book: "Everything that exists is an object. Everything that happens is a function call." This clearly lies outside that domain -- is there anything else like this?

First, I agree that this lies outside of that domain. I believe Chambers' quote concerns the R Environment, i.e. processes that are all taking place after this low level parsing phase. I'll touch on this a little bit more below, however. Anyways, the only other example of this sort of behavior I could find is the ** operator, which is a synonym for the more common exponentiation operator ^. As with right assignment, ** doesn't seem to be "recognized" as a function call, etc... by the interpreter:

R> `->` #Error: object '->' not found R> `**` #Error: object '**' not found  

I found this because it's the only other case where install_and_save2 is used by the C parser:

case '*':   /* Replace ** by ^.  This has been here since 1998, but is      undocumented (at least in the obvious places).  It is in      the index of the Blue Book with a reference to p. 431, the      help for 'Deprecated'.  S-PLUS 6.2 still allowed this, so      presumably it was for compatibility with S. */   if (nextchar('*')) {     yylval = install_and_save2("^", "**");     return '^';   } else     yylval = install_and_save("*"); return c; 

2) When exactly does this happen? I've got in mind that substitute(3 -> y) has already flipped the expression; I couldn't figure out from the source what substitute does that would have pinged the YACC...

Of course I'm still speculating here, but yes, I think we can safely assume that when you call substitute(3 -> y), from the perspective of the substitute function, the expression always was y <- 3; e.g. the function is completely unaware that you typed 3 -> y. do_substitute, like 99% of the C functions used by R, only handles SEXP arguments - an EXPRSXP in the case of 3 -> y (== y <- 3), I believe. This is what I was alluding to above when I made a distinction between the R Environment and the parsing process. I don't think there is anything that specifically triggers the parser to spring into action - but rather everything you input into the interpreter gets parsed. I did a little more reading about the YACC / Bison parser generator last night, and as I understand it (a.k.a. don't bet the farm on this), Bison uses the grammar you define (in the .y file(s)) to generate a parser in C - i.e. a C function which does the actual parsing of input. In turn, everything you input in an R session is first processed by this C parsing function, which then delegates the appropriate action to be taken in the R Environment (I'm using this term very loosely by the way). During this phase, lhs -> rhs will get translated to rhs <- lhs, ** to ^, etc... For example, this is an excerpt from one of the tables of primitive functions in names.c:

/* Language Related Constructs */  /* Primitives */ {"if",      do_if,      0,  200,    -1, {PP_IF,      PREC_FN,     1}}, {"while",   do_while,   0,  100,    2,  {PP_WHILE,   PREC_FN,     0}}, {"for",     do_for,     0,  100,    3,  {PP_FOR,     PREC_FN,     0}}, {"repeat",  do_repeat,  0,  100,    1,  {PP_REPEAT,  PREC_FN,     0}}, {"break",   do_break, CTXT_BREAK,   0,  0,  {PP_BREAK,   PREC_FN,     0}}, {"next",    do_break, CTXT_NEXT,    0,  0,  {PP_NEXT,    PREC_FN,     0}}, {"return",  do_return,  0,  0,  -1, {PP_RETURN,  PREC_FN,     0}}, {"function",    do_function,    0,  0,  -1, {PP_FUNCTION,PREC_FN,     0}}, {"<-",      do_set,     1,  100,    -1, {PP_ASSIGN,  PREC_LEFT,   1}}, {"=",       do_set,     3,  100,    -1, {PP_ASSIGN,  PREC_EQ,     1}}, {"<<-",     do_set,     2,  100,    -1, {PP_ASSIGN2, PREC_LEFT,   1}}, {"{",       do_begin,   0,  200,    -1, {PP_CURLY,   PREC_FN,     0}}, {"(",       do_paren,   0,  1,  1,  {PP_PAREN,   PREC_FN,     0}}, 

You will notice that ->, ->>, and ** are not defined here. As far as I know, R primitive expressions such as <- and [, etc... are the closest interaction the R Environment ever has with any underlying C code. What I am suggesting is that by this stage in process (from you typing a set characters into the interpreter and hitting 'Enter', up through the actual evaluation of a valid R expression), the parser has already worked its magic, which is why you can't get a function definition for -> or ** by surrounding them with backticks, as you typically can.

like image 109
nrussell Avatar answered Oct 07 '22 14:10

nrussell