Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unescape HTML &#nn; sequences

Tags:

r

My text has some HTML escaped characters, for instance, instead of ' there is '. Now I would like to unescape these sequences. Since I do not know which characters are escaped, I do not want to use a simple mapping such as in c("'"="'", ...).

I understand that the number after the ampersand is the decimal unicode number. So ' is \u27 since 27 is the hexidecimal representation of 39. So I thought a solution that involves

sprintf("\u%x", s)

where s is the extracted number between & and ;. However, this results in an error: "\u used without hex numbers."

What would be a better approach to convert HTML escaped sequences back to characters?

like image 740
Karsten W. Avatar asked Mar 14 '26 07:03

Karsten W.


1 Answers

Just for reference, here is the solution I came up with. It makes use of the great package gsubfn:

library(gsubfn)

I use a vector htmlchars for named html entities I scraped from Wikipedia. For brevity, I do not copy the vector in this answer here, but source it from pastebin:

source("http://pastebin.com/raw.php?i=XtzN1NMs") # creates variable htmlchars

Now the decoding function I was looking for is simply:

strdehtml <- function(s) {
    ret <- gsubfn("&#([0-9]+);", function(x) rawToChar(as.raw(as.numeric(x))), s)
    ret <- gsubfn("&([^;]+);", function(x) htmlchars[x], ret)
    return(ret)
}

Not sure if this covers all possible HTML characters, but it gets me working. For instance, it can be used thus:

test <- "My this &amp; last year&#39;s resolutions"
strdehtml(test)
[1] "My this & last year's resolutions"
like image 84
Karsten W. Avatar answered Mar 16 '26 22:03

Karsten W.