I'm trying to read tables into R from HTML pages that are mostly encoded in UTF-8 (and declare <meta charset="utf-8">
) but have some strings in some other encodings (I think Windows-1252 or ISO 8859-1). Here's an example. I want everything decoded properly into an R data frame. XML::readHTMLTable
takes an encoding
argument but doesn't seem to allow one to try multiple encodings.
So, in R, how can I try several encodings for each line of the input file? In Python 3, I'd do something like:
with open('file', 'rb') as o:
for line in o:
try:
line = line.decode('UTF-8')
except UnicodeDecodeError:
line = line.decode('Windows-1252')
There do seem to be R library functions for guessing character encodings, like stringi::stri_enc_detect
, but when possible, it's probably better to use the simpler determinstic method of trying a fixed set of encodings in order. It looks like the best way to do this is to take advantage of the fact that when iconv
fails to convert a string, it returns NA
.
linewise.decode = function(path)
sapply(readLines(path), USE.NAMES = F, function(line) {
if (validUTF8(line))
return(line)
l2 = iconv(line, "Windows-1252", "UTF-8")
if (!is.na(l2))
return(l2)
l2 = iconv(line, "Shift-JIS", "UTF-8")
if (!is.na(l2))
return(l2)
stop("Encoding not detected")
})
If you create a test file with
$ python3 -c 'with open("inptest", "wb") as o: o.write(b"This line is ASCII\n" + "This line is UTF-8: I like π\n".encode("UTF-8") + "This line is Windows-1252: Müller\n".encode("Windows-1252") + "This line is Shift-JIS: ハローワールド\n".encode("Shift-JIS"))'
then linewise.decode("inptest")
indeed returns
[1] "This line is ASCII"
[2] "This line is UTF-8: I like π"
[3] "This line is Windows-1252: Müller"
[4] "This line is Shift-JIS: ハローワールド"
To use linewise.decode
with XML::readHTMLTable
, just say something like XML::readHTMLTable(linewise.decode("http://example.com"))
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With