From ?Quotes
:
\xnn character with given hex code (1 or 2 hex digits) \unnnn Unicode character with given code (1--4 hex digits)
In the case where the Unicode character has only one or two digits, I would expect these characters to be the same. In fact, one of the examples on the ?Quotes
help page shows:
"\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21"
## [1] "Hello World!"
"\u48\u65\u6c\u6c\u6f\u20\u57\u6f\u72\u6c\u64\u21"
## [1] "Hello World!"
However, under Linux, when trying to print a pound sign, I see
cat("\ua3")
## £
cat("\xa3")
## �
That is, the \x
hex code fails to display correctly. (This behaviour persisted with any locale that I tried.) Under Windows 7 both versions show a pound sign.
If I convert to integer and back then the pound sign displays correctly under Linux.
cat(intToUtf8(utf8ToInt("\xa3")))
## £
Incidentally, this doesn't work under Windows, since utf8ToInt("\xa3")
returns NA
.
Some \x
characters return NA
under Windows but throw an error under Linux. For example:
utf8ToInt("\xf0")
## Error in utf8ToInt("\xf0") : invalid UTF-8 string
("\uf0"
is a valid character.)
These examples show that there are some differences between \x
and \u
forms of characters, which seem to be OS-specific, but I can't see any logic in how they are defined.
What are the difference between these two character forms?
Unicode characters are distinguished by code points, which are conventionally represented by "U+" followed by four, five or six hexadecimal digits, for example U+00AE or U+1D310.
Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.
The "U+" notation is useful. It gives a way of marking hexadecimal digits as being Unicode code points, instead of octets, or unrestricted 16-bit quantities, or characters in other encodings. It works well in running text. The "U" suggests "Unicode".
The Unicode Standard specifies a unique numeric value and name for each character and defines three encoding forms for the bit representation of the numeric value. The name/value pair creates an identity for a character. The hexadecimal value representing a character is called a code point.
The escape sequence \xNN
inserts the raw byte NN
into a string, whereas \uNN
inserts the UTF-8 bytes for the Unicode code point NN
into a UTF-8 string:
> charToRaw('\xA3') [1] a3 > charToRaw('\uA3') [1] c2 a3
These two types of escape sequence cannot be mixed in the same string:
> '\ua3\xa3' Error: mixing Unicode and octal/hex escapes in a string is not allowed
This is because the escape sequences also define the encoding of the string. A \uNN
sequence explicitly sets the encoding of the entire string to "UTF-8", whereas \xNN
leaves it in the default "unknown" (aka. native) encoding:
> Encoding('\xa3') [1] "unknown" > Encoding('\ua3') [1] "UTF-8"
This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native
), but those with an "unknown" encoding are simply output as-is:
0xA3
is not a valid UTF-8 sequence, it gives you "�".0xA3
is the correct encoding for "£", that's what you see. (When the string is \uA3
, a conversion from UTF-8 to Windows-1252 takes place.)If the encoding is set explicitly, the appropriate conversion will take place on Linux:
> s <- '\xa3' > Encoding(s) <- 'latin1' > cat(s) £
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With