What's the prefered way in R to convert a character (vector) containing non-ASCII characters to html? I would for example like to convert
"ü"
to
"ü"
I am aware that this is possible by a clever use of gsub
(but has anyone doen it once and for all?) and I thought that the package R2HTML would do that, but it doesn't.
EDIT: Here is what I ended up using; it can obviously be extended by modifying the dictionary:
char2html <- function(x){
dictionary <- data.frame(
symbol = c("ä","ö","ü","Ä", "Ö", "Ü", "ß"),
html = c("ä","ö", "ü","Ä",
"Ö", "Ü","ß"))
for(i in 1:dim(dictionary)[1]){
x <- gsub(dictionary$symbol[i],dictionary$html[i],x)
}
x
}
x <- c("Buschwindröschen", "Weißdorn")
char2html(x)
This question is pretty old but I couldn't find any straightforward answer... So I came up with this simple function which uses the numerical html codes and works for LATIN 1 - Supplement (integer values 161 to 255). There's probably (certainly?) a function in some package that does it more thoroughly, but what follows is probably good enough for many applications...
conv_latinsupp <- function(...) {
out <- character()
for (s in list(...)) {
splitted <- unlist(strsplit(s, ""))
intvalues <- utf8ToInt(enc2utf8(s))
pos_to_modify <- which(intvalues >=161 & intvalues <= 255)
splitted[pos_to_modify] <- paste0("�", intvalues[pos_to_modify], ";")
out <- c(out, paste0(splitted, collapse = ""))
}
out
}
conv_latinsupp("aeiou", "àéïôù12345")
## [1] "aeiou" "àéïôù12345"
The XML
uses a method insertEntities
for this, but that method is internal. So you may use it at your own risk, as there are no guarantees that it will remain to operate like this in future versions.
Right now, your code could be accomplished using
char2html <- function(x) XML:::insertEntities(x, c("ä"="auml", "ö"="ouml", …))
The use of a named list instead of a data.frame feels kind of elegant, but doesn't change the core of things. Under the hood, insertEntities
calls gsub
in much the same way your code does.
If numeric HTML entities are valid in your environment, then you could probably convert all your text into those using utf8ToInt
and then turn safely printable ASCII characters back into unescaped form. This would save you the trouble of maintaining a dictionary for your entities.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With