I was helping someone today regex some info out of a pdf file that we read in as a txt file. Unfortunately the tm packages readPDF function was not working correctly at the time, though through a few modifications we were able to get it to work just fine. While we were regexing out some of the fluff from the .txt file we found something that was surprising to most of us, namely that the string "\040" gets interpreted as a space, " ".
> x <- "\040"
> x
> [1] " "
This doesn't happen for other, similar character strings (i.e. "\n" or "\t") that you may expect this to happen for.
> y <- "\n"
> y
> [1] "\n"
> z <- "\t"
> z
>[1] "\t"
Why is this? What other character strings are interpreted differently in R?
EDIT:
It seems after simple experimentation, any "\xxx" where x are digits yields a different result. What is the value of this?
Take a look here: http://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html
Backslash is used to start an escape sequence inside character constants. Escaping a character not in the following table is an error.
...
\nnn character with given octal code (1, 2 or 3 digits)
Then take a look at this ASCII table to see how octal codes get represented. As you will see 040 is a space.
And just for fun:
> '\110\145\154\154\157\040\127\157\162\154\144\041'
[1] "Hello World!"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With