Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Forcefully set Encoding from unknown to UTF-8 or any encoding in R?

Tags:

r

encoding

iconv

I am reading data from an old proprietary database. Unfortunately I end up (only for some strings) with Encoding(mychar_vector) returning "unknown". Unfortunately I am using a wrapper around a closed source c hli (host language interface), so there's probably not much I can do about that – if so I am glad to be proven wrong here...

However, looking at the string vector except for a few replacements I had to make (see my related question) using gsub the strings look ok. I would love to re-gain control of the encoding. Is there a way to forcefully set the encoding to UTF-8? I tried to

Encoding(mychar_vector) <- "UTF-8"
# or
mychar_vector <- enc2utf8(mychar_vector)

But none of this worked out. Just got "unknown" in return immediately after checking. Also looked into iconv but there is obviously no way converting from "unknown" to UTF-8 as there is no mapping.

Is there a way to tell R, that only UTF-8 characters are involved and thus the encoding can be set to UTF-8. Note that some of the elements of the vector are already UTF-8.

like image 721
Matt Bannert Avatar asked Mar 20 '13 21:03

Matt Bannert


2 Answers

I, too, have been down the encoding rabbit hole, and one of the important things I learned is that "unknown" encoding doesn't have to mean it's not UTF-8. Or bad. Or something that needs to be fixed.

Here are some examples:

# Some string that might be UTF-8 or just some ASCII (but created in UTF-8 editor/environment)
ambiguous <- "wat"
Encoding(ambiguous)
#> [1] "unknown"

# Forced coercion to UTF-8 via stringi
ambiguous <- stringi::stri_enc_toutf8("wat", is_unknown_8bit = TRUE)

# Still ambiguous
Encoding(ambiguous)
#> [1] "unknown"

# Some pretty-sure-not-ASCII string
totallygermanic <- "wät"

# It's UTF-8 because that's what my RStudio and every other part of my env is set to
Encoding(totallygermanic)
#> [1] "UTF-8"

# Let's force it to be unknowm
Encoding(totallygermanic) <- "unknown"

# Still prints ok
totallygermanic
#> [1] "wät"

# What's its encoding now?
Encoding(totallygermanic)
#> [1] "unknown"

# Converting it to UTF-8 still prints ok
stringi::stri_enc_toutf8(totallygermanic)
#> [1] "wät"

# So the converted string is UTF-8, right? No.
Encoding(stringi::stri_enc_toutf8(totallygermanic))
#> [1] "unknown"

# Maybe we should just guess?
stringi::stri_enc_detect("wat")
#> [[1]]
#>     Encoding Language Confidence
#> 1 ISO-8859-1       en       0.75
#> 2 ISO-8859-2       ro       0.75
#> 3      UTF-8                0.15

stringi::stri_enc_detect("wät")
#> [[1]]
#>   Encoding Language Confidence
#> 1    UTF-8                 0.8
#> 2 UTF-16BE                 0.1
#> 3 UTF-16LE                 0.1
#> 4  GB18030       zh        0.1
#> 5   EUC-JP       ja        0.1
#> 6   EUC-KR       ko        0.1
#> 7     Big5       zh        0.1

Created on 2019-02-11 by the reprex package (v0.2.1)

The takeaway is this: If your string is not obviously non-ASCII, e.g. it only contains letters a-z, it could be ASCII, or it could be UTF-8, so you get an unknown, but that doesn't have to mean your string is not actually UTF-8, apparently. You may try to forcibly coerce the string, aber in the process you might break something that was not broken at all. In my experience, it may be perfectly adequate to use some conversion function like stringi::stri_enc_toutf8 on a variable/vector, test if it prints/works as expected, maybe using a regular expression filter for possibly problematic characters (as a German native we tend to look for äöüß).

Anway, if you want to dive into the nitty gritty I can recommend looking into the stringi package and it's encoding functions. This package is the power behind stringr, which provides a more high-level interface.

like image 152
Jemus42 Avatar answered Oct 03 '22 22:10

Jemus42


When I have dealt with files that are not UTF-8 encoded properly, I have used iconv with great success to forcefully convert the file by simply running a bash script in my rmarkdown notebook:

iconv -c -t UTF-8 myfile.txt > Ratebeer-myfile.txt

You could also try this where file is your original file, and file-iconv is the modified file:

#iconv −f iso−8859−1 −t UTF−8 file.txt > file-iconv.txt

Verify the encoding with:

file -I file-iconv.txt

Let me know if this helps or not.

like image 31
petergensler Avatar answered Oct 03 '22 22:10

petergensler