Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does R handle Unicode / UTF-8?

Tags:

r

unicode

utf-8

If I write

`Δ` <- function(a,b)   (a-b)/a 

then I can include U+394 so long as it's enclosed in backticks. (By contrast, Δ <- function(a,b) (a-b)/a fails with unexpected input in "�".) So apparently R parses UTF-8 or Unicode or something like that. The assignment goes well and so does the evaluation of eg

`Δ`(1:5, 9:13)

. And I can also evaluate Δ(1:5, 9:13).

Finally, if I defined something like winsorise <- function(x, λ=.05) { ... } then λ (U+3bb) doesn't need to be "introduced to" R with a backtick. I can then call winsorise(data, .1) with no problems.


The only mention in R's documentation I can find of unicode is over my head. Could someone who understands it better explain to me — what's going on "under the hood" when R needs the ` to understand assignment to ♔, but can parse ♔(a,b,c) once assigned?

like image 342
isomorphismes Avatar asked Feb 12 '15 17:02

isomorphismes


People also ask

Is UTF-8 the same as Unicode?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

What is R default encoding?

Encoding in R 1.8. The default behavour is to treat characters as a stream of 8-bit bytes, and not to interpret them other than to assume that each byte represents one character. The only exceptions are. The connections mechanism allows the remapping of input files to the `native' encoding.

Is UTF-8 ASCII or Unicode?

UTF-8 encodes Unicode characters into a sequence of 8-bit bytes. The standard has a capacity for over a million distinct codepoints and is a superset of all characters in widespread use today. By comparison, ASCII (American Standard Code for Information Interchange) includes 128 character codes.

Can UTF-8 handle all languages?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL).


2 Answers

I can't speak to what's going on under the hood regarding the function calls vs. function arguments, but this email from Prof. Ripley from 2008 may shed some light (excerpt below):

R passes around, prints and plots UTF-8 character data pretty well, but it translates to the native encoding for almost all character-level manipulations (and not just on Windows). ?Encoding spells out the exceptions [...]

The reason R does this translation (on Windows at least) is mentioned in the documentation that the OP linked to:

Windows has no UTF-8 locales, but rather expects to work with UCS-2 strings. R (being written in standard C) would not work internally with UCS-2 without extensive changes.

The R documentation for ?Quotes explains how you can sometimes use out-of-locale characters anyway (emphasis added):

Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.

The definition of a letter depends on the current locale, but only ASCII digits are considered to be digits.

Such identifiers are also known as syntactic names and may be used directly in R code. Almost always, other names can be used provided they are quoted. The preferred quote is the backtick (`), and deparse will normally use it, but under many circumstances single or double quotes can be used (as a character constant will often be converted to a name). One place where backticks may be essential is to delimit variable names in formulae: see formula.

There is another way to get at such characters, which is using the unicode escape sequence (like \u0394 for Δ). This is usually a bad idea if you're using that character for anything other than text on a plot (i.e., don't do this for variable or function names; cf. this quote from the R 2.7 release notes, when much of the current UTF-8 support was added):

If a string presented to the parser contains a \uxxxx escape invalid in the current locale, the string is recorded in UTF-8 with the encoding declared. This is likely to throw an error if it is used later in the session, but it can be printed, and used for e.g. plotting on the windows() device. So "\u03b2" gives a Greek small beta and "\u2642" a 'male sign'. Such strings will be printed as e.g. <U+2642> except in the Rgui console (see below).

I think this addresses most of your questions, though I don't know why there is a difference between the function name and function argument examples you gave; hopefully someone more knowledgable can chime in on that. FYI, on Linux all of these different ways of assigning and calling a function work without error (because the system locale is UTF-8, so no translation need occur):

Δ <- function(a,b) (a-b)/a         # no error
`Δ` <- function(a,b) (a-b)/a       # no error
"Δ" <- function(a,b) (a-b)/a       # no error
"\u0394" <- function(a,b) (a-b)/a  # no error
Δ(1:5, 9:13)        # -8.00 -4.00 -2.67 -2.00 -1.60
`Δ`(1:5, 9:13)      # same
"Δ"(1:5, 9:13)      # same
"\u0394"(1:5, 9:13) # same

sessionInfo()

# R version 3.1.2 (2014-10-31)
# Platform: x86_64-pc-linux-gnu (64-bit)

# locale:
# LC_CTYPE=en_US.UTF-8    LC_NUMERIC=C                LC_TIME=en_US.UTF-8
# LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8     LC_MESSAGES=en_US.UTF-8
# LC_PAPER=en_US.UTF-8    LC_NAME=C                   LC_ADDRESS=C
# LC_TELEPHONE=C          LC_MEASUREMENT=en_US.UTF-8  LC_IDENTIFICATION=C

# attached base packages:
# stats  graphics  grDevices  utils  datasets  methods  base  
like image 90
drammock Avatar answered Oct 19 '22 16:10

drammock


For the record, under R-devel (2015-02-11 r67792), Win 7, English UK locale, I see:

options(encoding = "UTF-8")

`Δ` <- function(a,b) (a-b)/a 
## Error: \uxxxx sequences not supported inside backticks (line 1)

Δ <- function(a,b) (a-b)/a
## Error: unexpected input in "\"

"Δ" <- function(a,b) (a-b)/a      # OK

`Δ`(1:5, 9:13)
## Error: \uxxxx sequences not supported inside backticks (line 1)

Δ(1:5, 9:13)
## Error: unexpected input in "\"

"Δ"(1:5, 9:13)
## Error: could not find function "Δ"
like image 34
Richie Cotton Avatar answered Oct 19 '22 15:10

Richie Cotton