Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R String Interpretation: why does "\040" get interpreted as " " and what other potential pitfalls could I come across in string interpretation?

I was helping someone today regex some info out of a pdf file that we read in as a txt file. Unfortunately the tm packages readPDF function was not working correctly at the time, though through a few modifications we were able to get it to work just fine. While we were regexing out some of the fluff from the .txt file we found something that was surprising to most of us, namely that the string "\040" gets interpreted as a space, " ".

> x <- "\040"    
> x    
> [1] " "

This doesn't happen for other, similar character strings (i.e. "\n" or "\t") that you may expect this to happen for.

> y <- "\n"   
> y    
> [1] "\n"    
> z <- "\t"    
> z    
>[1] "\t"

Why is this? What other character strings are interpreted differently in R?

EDIT:

It seems after simple experimentation, any "\xxx" where x are digits yields a different result. What is the value of this?

like image 983
stanekam Avatar asked Feb 15 '23 09:02

stanekam


1 Answers

Take a look here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html

Backslash is used to start an escape sequence inside character constants. Escaping a character not in the following table is an error.

...

\nnn character with given octal code (1, 2 or 3 digits)

Then take a look at this ASCII table to see how octal codes get represented. As you will see 040 is a space.

And just for fun:

> '\110\145\154\154\157\040\127\157\162\154\144\041'
[1] "Hello World!"
like image 94
Thomas Avatar answered Feb 17 '23 12:02

Thomas