What's the difference between hex code (\x) and unicode (\u) chars?

Tags:

From ?Quotes:

\xnn   character with given hex code (1 or 2 hex digits)  
\unnnn Unicode character with given code (1--4 hex digits)

In the case where the Unicode character has only one or two digits, I would expect these characters to be the same. In fact, one of the examples on the ?Quotes help page shows:

"\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x21"
## [1] "Hello World!"
"\u48\u65\u6c\u6c\u6f\u20\u57\u6f\u72\u6c\u64\u21"
## [1] "Hello World!"

However, under Linux, when trying to print a pound sign, I see

cat("\ua3")
## £
cat("\xa3")
## �

That is, the \x hex code fails to display correctly. (This behaviour persisted with any locale that I tried.) Under Windows 7 both versions show a pound sign.

If I convert to integer and back then the pound sign displays correctly under Linux.

cat(intToUtf8(utf8ToInt("\xa3")))
## £

Incidentally, this doesn't work under Windows, since utf8ToInt("\xa3") returns NA.

Some \x characters return NA under Windows but throw an error under Linux. For example:

utf8ToInt("\xf0")
## Error in utf8ToInt("\xf0") : invalid UTF-8 string

("\uf0" is a valid character.)

These examples show that there are some differences between \x and \u forms of characters, which seem to be OS-specific, but I can't see any logic in how they are defined.

What are the difference between these two character forms?

555

asked Oct 29 '15 13:10

Richie Cotton

1 Answers

The escape sequence \xNN inserts the raw byte NN into a string, whereas \uNN inserts the UTF-8 bytes for the Unicode code point NN into a UTF-8 string:

> charToRaw('\xA3') [1] a3 > charToRaw('\uA3') [1] c2 a3

These two types of escape sequence cannot be mixed in the same string:

> '\ua3\xa3' Error: mixing Unicode and octal/hex escapes in a string is not allowed

This is because the escape sequences also define the encoding of the string. A \uNN sequence explicitly sets the encoding of the entire string to "UTF-8", whereas \xNN leaves it in the default "unknown" (aka. native) encoding:

> Encoding('\xa3') [1] "unknown" > Encoding('\ua3') [1] "UTF-8"

This becomes important when printing strings, as they need to be converted into the appropriate output encoding (e.g., that of your console). Strings with a defined encoding can be converted appropriately (see enc2native), but those with an "unknown" encoding are simply output as-is:

On Linux, your console is probably expecting UTF-8 text, and as 0xA3 is not a valid UTF-8 sequence, it gives you "�".
On Windows, your console is probably expecting Windows-1252 text, and as 0xA3 is the correct encoding for "£", that's what you see. (When the string is \uA3, a conversion from UTF-8 to Windows-1252 takes place.)

If the encoding is set explicitly, the appropriate conversion will take place on Linux:

> s <- '\xa3' > Encoding(s) <- 'latin1' > cat(s) £

178

answered Oct 11 '22 19:10

一二三

Related questions
                            
                                Grep in R using OR and NOT
                            
                                Specify height and width of ggplot graph in Rmarkdown knitr output
                            
                                Extracting unique numbers from string in R
                            
                                Using dplyr to conditionally replace values in a column
                            
                                How can I plot a tree (and squirrels) in R?
                            
                                Find position of first value greater than X in a vector
                            
                                How to remove all the NA from a Vector? [duplicate]
                            
                                Counting non NAs in a data frame; getting answer as a vector
                            
                                How can a test script inform R CMD check that it should emit a custom message?
                            
                                Extrafont and ggsave: Characters end up on top of another
                            
                                Why is R slowing down as time goes on, when the computations are the same?
                            
                                Code folding for individual chunks in R Markdown?
                            
                                Speed up RData load
                            
                                Override a function that is imported in a namespace
                            
                                Rstudio is duplicating commands in the command line
                            
                                How do I override a non-visible function in the package namespace?
                            
                                Enter new column names as string in dplyr's rename function
                            
                                How to make join operations in dplyr silent?
                            
                                := (pass by reference) operator in the data.table package modifies another data table object simultaneously
                            
                                ggplot2 warning: Stacking not well defined when ymin != 0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the difference between hex code (\x) and unicode (\u) chars?

Tags:

r

hex

unicode

Richie Cotton

People also ask

1 Answers

一二三

Recent Activity

Donate For Us