My text has some HTML escaped characters, for instance, instead of ' there is '. Now I would like to unescape these sequences. Since I do not know which characters are escaped, I do not want to use a simple mapping such as in c("'"="'", ...).
I understand that the number after the ampersand is the decimal unicode number. So ' is \u27 since 27 is the hexidecimal representation of 39. So I thought a solution that involves
sprintf("\u%x", s)
where s is the extracted number between & and ;. However, this results in an error: "\u used without hex numbers."
What would be a better approach to convert HTML escaped sequences back to characters?
Just for reference, here is the solution I came up with. It makes use of the great package gsubfn:
library(gsubfn)
I use a vector htmlchars for named html entities I scraped from Wikipedia. For brevity, I do not copy the vector in this answer here, but source it from pastebin:
source("http://pastebin.com/raw.php?i=XtzN1NMs") # creates variable htmlchars
Now the decoding function I was looking for is simply:
strdehtml <- function(s) {
ret <- gsubfn("&#([0-9]+);", function(x) rawToChar(as.raw(as.numeric(x))), s)
ret <- gsubfn("&([^;]+);", function(x) htmlchars[x], ret)
return(ret)
}
Not sure if this covers all possible HTML characters, but it gets me working. For instance, it can be used thus:
test <- "My this & last year's resolutions"
strdehtml(test)
[1] "My this & last year's resolutions"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With