How can you display a unicode string, say: <pre class="prettyprint"><code>x <- "•" </code></pre> using its escaped equivalent? <pre class="prettyprint"><code>y <- "\u2022" identical(x, y) # [1] TRUE </code></pre> (I'd like to be able to do this because CRAN packages must contain only ASCII, but sometimes you want to use unicode in an error message or similar)

R automatically escapes unicode in C locale: <pre class="prettyprint lang-r prettyprint-override"><code>x <- "•" Sys.setlocale(locale = 'C') print(x) # [1] "<U+2022>" </code></pre>

I wrote a small package called <code>uniscape</code> that can convert non-ASCII characters to the corresponding <code>"\u1234"</code> or <code>"\U12345678"</code> Unicode escape codes (obviously with a literal backslash). It can do so for any character or only for characters inside an R string (single or double quoted). The following example shows how <code>u_escape</code> converts a character. The output is then surrounded with quotes, parsed, and evaluated. The final result matches the original character. <pre class="prettyprint"><code>x <- rawToChar(as.raw(c(0xe2, 0x80, 0xa2))) Encoding(x) <- "UTF-8" x # [1] "•" x_u <- uniscape::u_escape(x) x_u # [1] "\\u2022" y <- eval(parse(text = paste0('"', x_u, '"'))) y # [1] "•" identical(x, y) # [1] TRUE </code></pre> The package (on GitHub) also provides RStudio addins for convenience. The addins operate on the active source editor document. The package has no hard dependencies except <code>rstudioapi</code>. This picture shows an example document with a selected text area and the RStudio addin window with three <code>uniscape</code> addins. "Escape selection" addin has been selected. <img src="https://i.stack.imgur.com/WBqya.png" alt="Example document and addin window"> This is the result after applying "Escape selection", with the encoding sequence of each non-ASCII character automatically highlighted (selected). <img src="https://i.stack.imgur.com/Zm59i.png" alt="Result of Escape selection addin"> After undoing the previous operation, this is the result for "Escape strings in file". Each affected R string in the active file is automatically highlighted by the addin. Commented strings are ignored. "Escape selected strings" does the same but only for the selected text area. <img src="https://i.stack.imgur.com/5Qtoa.png" alt="Result of Escape strings in file">

After digging into some documentation about <code>iconv</code>, I think you can accomplish this using only the <code>base</code> package. But you need to pay extra attention to the encoding of the string. On a system with UTF-8 encoding: <pre class="prettyprint"><code>> stri_escape_unicode("你好世界") [1] "\\u4f60\\u597d\\u4e16\\u754c" # use big endian > iconv(x, "UTF-8", "UTF-16BE", toRaw=T) [[1]] [1] 4f 60 59 7d 4e 16 75 4c > x <- "•" > iconv(x, "UTF-8", "UTF-16BE", toRaw=T) [[1]] [1] 20 22 </code></pre> But, if you are on a system with <code>latin1</code> encoding, things may go wrong. <pre class="prettyprint"><code>> x <- "•" > y <- "\u2022" > identical(x, y) [1] FALSE > stri_escape_unicode(x) [1] "\\u0095" # <- oops! # culprit > Encoding(x) [1] "latin1" # and it causes problem for iconv > iconv(x, Encoding(x), "Unicode") Error in iconv(x, Encoding(x), "Unicode") : unsupported conversion from 'latin1' to 'Unicode' in codepage 1252 > iconv(x, Encoding(x), "UTF-16BE") Error in iconv(x, Encoding(x), "UTF-16BE") : embedded nul in string: '\0•' </code></pre> It is safer to cast the string into UTF-8 before converting to Unicode: <pre class="prettyprint"><code>> iconv(enc2utf8(enc2native(x)), "UTF-8", "UTF-16BE", toRaw=T) [[1]] [1] 20 22 </code></pre> EDIT: This may cause some problems for strings already in UTF-8 encoding on some particular systems. Maybe it's safer to check the encoding before conversion. <pre class="prettyprint"><code>> Encoding("•") [1] "latin1" > enc2native("•") [1] "•" > enc2native("\u2022") [1] "•" # on a Windows with default latin1 encoding > Encoding("测试") [1] "UTF-8" > enc2native("测试") [1] "<U+6D4B><U+8BD5>" # <- BAD! </code></pre> For some characters or lanuages, <code>UTF-16</code> may not be enough. So probably you should be using <code>UTF-32</code> since <blockquote> The UTF-32 form of a character is a direct representation of its codepoint. </blockquote> Based on above trial and error, below is probably one safer escape function we can write: <pre class="prettyprint"><code>unicode_escape <- function(x, endian="big") { if (Encoding(x) != 'UTF-8') { x <- enc2utf8(enc2native(x)) } to.enc <- ifelse(endian == 'big', 'UTF-32BE', 'UTF-32LE') bytes <- strtoi(unlist(iconv(x, "UTF-8", "UTF-32BE", toRaw=T)), base=16) # there may be some better way to do thibs. runes <- matrix(bytes, nrow=4) escaped <- apply(runes, 2, function(rb) { nonzero.bytes <- rb[rb > 0] ifelse(length(nonzero.bytes) > 1, # convert back to hex paste("\\u", paste(as.hexmode(nonzero.bytes), collapse=""), sep=""), rawToChar(as.raw(nonzero.bytes)) ) }) paste(escaped, collapse="") } </code></pre> <h3>Tests:</h3> <pre class="prettyprint"><code>> unicode_escape("•••ERROR!!!•••") [1] "\\u2022\\u2022\\u2022ERROR!!!\\u2022\\u2022\\u2022" > unicode_escape("Hello word! 你好世界！") [1] "Hello word! \\u4f60\\u597d\\u4e16\\u754c!" > "\u4f60\u597d\u4e16\u754c" [1] "你好世界" </code></pre>

The package <code>stringi</code> has a method for doing this <pre class="prettyprint"><code>stri_escape_unicode(y) # [1] "\\u2022" </code></pre>

Automatically escape unicode characters

Tags:

r

How can you display a unicode string, say:

x <- "•"

using its escaped equivalent?

y <- "\u2022"

identical(x, y)
# [1] TRUE

(I'd like to be able to do this because CRAN packages must contain only ASCII, but sometimes you want to use unicode in an error message or similar)

377

asked Aug 14 '14 13:08

hadley

4 Answers

R automatically escapes unicode in C locale:

x <- "•"
Sys.setlocale(locale = 'C')
print(x)
# [1] "<U+2022>"

124

answered Oct 01 '22 03:10

Jeroen Ooms

I wrote a small package called uniscape that can convert non-ASCII characters to the corresponding "\u1234" or "\U12345678" Unicode escape codes (obviously with a literal backslash). It can do so for any character or only for characters inside an R string (single or double quoted). The following example shows how u_escape converts a character. The output is then surrounded with quotes, parsed, and evaluated. The final result matches the original character.

x <- rawToChar(as.raw(c(0xe2, 0x80, 0xa2)))
Encoding(x) <- "UTF-8"
x
# [1] "•"
x_u <- uniscape::u_escape(x)
x_u
# [1] "\\u2022"
y <- eval(parse(text = paste0('"', x_u, '"')))
y
# [1] "•"
identical(x, y)
# [1] TRUE

The package (on GitHub) also provides RStudio addins for convenience. The addins operate on the active source editor document. The package has no hard dependencies except rstudioapi.

This picture shows an example document with a selected text area and the RStudio addin window with three uniscape addins. "Escape selection" addin has been selected. Example document and addin window

This is the result after applying "Escape selection", with the encoding sequence of each non-ASCII character automatically highlighted (selected). Result of Escape selection addin

After undoing the previous operation, this is the result for "Escape strings in file". Each affected R string in the active file is automatically highlighted by the addin. Commented strings are ignored. "Escape selected strings" does the same but only for the selected text area. Result of Escape strings in file

answered Nov 05 '22 11:11

mvkorpel

After digging into some documentation about iconv, I think you can accomplish this using only the base package. But you need to pay extra attention to the encoding of the string.

On a system with UTF-8 encoding:

> stri_escape_unicode("你好世界")
[1] "\\u4f60\\u597d\\u4e16\\u754c"

# use big endian
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 4f 60 59 7d 4e 16 75 4c

> x <- "•"
> iconv(x, "UTF-8", "UTF-16BE", toRaw=T)    
[[1]]
[1] 20 22

But, if you are on a system with latin1 encoding, things may go wrong.

> x <- "•"
> y <- "\u2022"
> identical(x, y)
[1] FALSE
> stri_escape_unicode(x)
[1] "\\u0095" # <- oops!

# culprit
> Encoding(x)
[1] "latin1"

# and it causes problem for iconv
> iconv(x, Encoding(x), "Unicode")
Error in iconv(x, Encoding(x), "Unicode") : 
  unsupported conversion from 'latin1' to 'Unicode' in codepage 1252
> iconv(x, Encoding(x), "UTF-16BE")
Error in iconv(x, Encoding(x), "UTF-16BE") : 
  embedded nul in string: '\0•'

It is safer to cast the string into UTF-8 before converting to Unicode:

> iconv(enc2utf8(enc2native(x)), "UTF-8", "UTF-16BE", toRaw=T)
[[1]]
[1] 20 22

EDIT: This may cause some problems for strings already in UTF-8 encoding on some particular systems. Maybe it's safer to check the encoding before conversion.

> Encoding("•")
[1] "latin1"
> enc2native("•")
[1] "•"
> enc2native("\u2022")
[1] "•"
# on a Windows with default latin1 encoding
> Encoding("测试") 
[1] "UTF-8"
> enc2native("测试") 
[1] "<U+6D4B><U+8BD5>"   # <- BAD!

For some characters or lanuages, UTF-16 may not be enough. So probably you should be using UTF-32 since

The UTF-32 form of a character is a direct representation of its codepoint.

Based on above trial and error, below is probably one safer escape function we can write:

unicode_escape <- function(x, endian="big") {
  if (Encoding(x) != 'UTF-8') {
    x <- enc2utf8(enc2native(x))
  }
  to.enc <- ifelse(endian == 'big', 'UTF-32BE', 'UTF-32LE')

  bytes <- strtoi(unlist(iconv(x, "UTF-8", "UTF-32BE", toRaw=T)), base=16)
  # there may be some better way to do thibs.
  runes <- matrix(bytes, nrow=4)
  escaped <- apply(runes, 2, function(rb) {
    nonzero.bytes <- rb[rb > 0]
    ifelse(length(nonzero.bytes) > 1, 
           # convert back to hex
           paste("\\u", paste(as.hexmode(nonzero.bytes), collapse=""), sep=""),
           rawToChar(as.raw(nonzero.bytes))
           )
  })
  paste(escaped, collapse="")
}

Tests:

> unicode_escape("•••ERROR!!!•••")
[1] "\\u2022\\u2022\\u2022ERROR!!!\\u2022\\u2022\\u2022"
> unicode_escape("Hello word! 你好世界！")
[1] "Hello word! \\u4f60\\u597d\\u4e16\\u754c!"
> "\u4f60\u597d\u4e16\u754c"
[1] "你好世界"

answered Nov 05 '22 09:11

Xin Yin

The package stringi has a method for doing this

stri_escape_unicode(y)
# [1] "\\u2022"

answered Nov 05 '22 10:11

konvas

Related questions
                            
                                Error in object[seq_len(ile)] : object of type 'symbol' is not subsettable
                            
                                How to count number of spaces just after the date information?
                            
                                Proportion with ggplot geom_bar [duplicate]
                            
                                Viewing dataframes in Spyder using a command in its console
                            
                                How can I create a Docker image to run both Python and R?
                            
                                R package 'ps' fails to install because permission denied to mv in final step of install
                            
                                Error in CPL_transform(x, crs, aoi, pipeline, reverse): OGRCreateCoordinateTransformation() returned NULL: PROJ available?
                            
                                SQL-like functionality in R
                            
                                sweave and ggplot2: no pdfs generated at all
                            
                                Add a line from another data.frame to qplot
                            
                                Multiple histograms in ggplot2
                            
                                Labelling ggdendro leaves in multiple colors
                            
                                R Subset XTS weekdays
                            
                                Python-like unpacking of numeric value in R [duplicate]
                            
                                Using variable value as column name in data.frame or cbind
                            
                                How to speed up GLM estimation?
                            
                                Class of data.table column
                            
                                Generate all possible n choose 2 pairs from a vector in R, efficient and fast [duplicate]
                            
                                R extract first number from string
                            
                                Calculate average monthly total by groups from data.table in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With