Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert character to html in R

Tags:

html

r

What's the prefered way in R to convert a character (vector) containing non-ASCII characters to html? I would for example like to convert

  "ü"

to

  "ü"

I am aware that this is possible by a clever use of gsub (but has anyone doen it once and for all?) and I thought that the package R2HTML would do that, but it doesn't.

EDIT: Here is what I ended up using; it can obviously be extended by modifying the dictionary:

char2html <- function(x){
  dictionary <- data.frame(
    symbol = c("ä","ö","ü","Ä", "Ö", "Ü", "ß"),
    html = c("&auml;","&ouml;", "&uuml;","&Auml;",
             "&Ouml;", "&Uuml;","&szlig;"))
  for(i in 1:dim(dictionary)[1]){
    x <- gsub(dictionary$symbol[i],dictionary$html[i],x)
  }
  x
}

x <- c("Buschwindröschen", "Weißdorn")
char2html(x)
like image 483
Philipp Avatar asked Oct 22 '12 19:10

Philipp


2 Answers

This question is pretty old but I couldn't find any straightforward answer... So I came up with this simple function which uses the numerical html codes and works for LATIN 1 - Supplement (integer values 161 to 255). There's probably (certainly?) a function in some package that does it more thoroughly, but what follows is probably good enough for many applications...

conv_latinsupp <- function(...) {
  out <- character()
  for (s in list(...)) {
    splitted <- unlist(strsplit(s, ""))
    intvalues <- utf8ToInt(enc2utf8(s))
    pos_to_modify <- which(intvalues >=161 & intvalues <= 255)
    splitted[pos_to_modify] <- paste0("&#0",  intvalues[pos_to_modify], ";")
    out <- c(out, paste0(splitted, collapse = ""))
  }
  out
}

conv_latinsupp("aeiou", "àéïôù12345")
## [1] "aeiou"   "&#0224;&#0233;&#0239;&#0244;&#0249;12345"
like image 146
Dominic Comtois Avatar answered Nov 01 '22 21:11

Dominic Comtois


The XML uses a method insertEntities for this, but that method is internal. So you may use it at your own risk, as there are no guarantees that it will remain to operate like this in future versions.

Right now, your code could be accomplished using

char2html <- function(x) XML:::insertEntities(x, c("ä"="auml", "ö"="ouml", …))

The use of a named list instead of a data.frame feels kind of elegant, but doesn't change the core of things. Under the hood, insertEntities calls gsub in much the same way your code does.

If numeric HTML entities are valid in your environment, then you could probably convert all your text into those using utf8ToInt and then turn safely printable ASCII characters back into unescaped form. This would save you the trouble of maintaining a dictionary for your entities.

like image 34
MvG Avatar answered Nov 01 '22 20:11

MvG