I am reading data from an old proprietary database. Unfortunately I end up (only for some strings) with Encoding(mychar_vector)
returning "unknown"
. Unfortunately I am using a wrapper around a closed source c hli
(host language interface), so there's probably not much I can do about that – if so I am glad to be proven wrong here...
However, looking at the string vector except for a few replacements I had to make (see my related question) using gsub
the strings look ok. I would love to re-gain control of the encoding. Is there a way to forcefully set the encoding to UTF-8? I tried to
Encoding(mychar_vector) <- "UTF-8"
# or
mychar_vector <- enc2utf8(mychar_vector)
But none of this worked out. Just got "unknown"
in return immediately after checking. Also looked into iconv
but there is obviously no way converting from "unknown" to UTF-8 as there is no mapping.
Is there a way to tell R, that only UTF-8 characters are involved and thus the encoding can be set to UTF-8. Note that some of the elements of the vector are already UTF-8.
I, too, have been down the encoding rabbit hole, and one of the important things I learned is that "unknown"
encoding doesn't have to mean it's not UTF-8. Or bad. Or something that needs to be fixed.
Here are some examples:
# Some string that might be UTF-8 or just some ASCII (but created in UTF-8 editor/environment)
ambiguous <- "wat"
Encoding(ambiguous)
#> [1] "unknown"
# Forced coercion to UTF-8 via stringi
ambiguous <- stringi::stri_enc_toutf8("wat", is_unknown_8bit = TRUE)
# Still ambiguous
Encoding(ambiguous)
#> [1] "unknown"
# Some pretty-sure-not-ASCII string
totallygermanic <- "wät"
# It's UTF-8 because that's what my RStudio and every other part of my env is set to
Encoding(totallygermanic)
#> [1] "UTF-8"
# Let's force it to be unknowm
Encoding(totallygermanic) <- "unknown"
# Still prints ok
totallygermanic
#> [1] "wät"
# What's its encoding now?
Encoding(totallygermanic)
#> [1] "unknown"
# Converting it to UTF-8 still prints ok
stringi::stri_enc_toutf8(totallygermanic)
#> [1] "wät"
# So the converted string is UTF-8, right? No.
Encoding(stringi::stri_enc_toutf8(totallygermanic))
#> [1] "unknown"
# Maybe we should just guess?
stringi::stri_enc_detect("wat")
#> [[1]]
#> Encoding Language Confidence
#> 1 ISO-8859-1 en 0.75
#> 2 ISO-8859-2 ro 0.75
#> 3 UTF-8 0.15
stringi::stri_enc_detect("wät")
#> [[1]]
#> Encoding Language Confidence
#> 1 UTF-8 0.8
#> 2 UTF-16BE 0.1
#> 3 UTF-16LE 0.1
#> 4 GB18030 zh 0.1
#> 5 EUC-JP ja 0.1
#> 6 EUC-KR ko 0.1
#> 7 Big5 zh 0.1
Created on 2019-02-11 by the reprex package (v0.2.1)
The takeaway is this: If your string is not obviously non-ASCII, e.g. it only contains letters a-z, it could be ASCII, or it could be UTF-8, so you get an unknown
, but that doesn't have to mean your string is not actually UTF-8, apparently. You may try to forcibly coerce the string, aber in the process you might break something that was not broken at all. In my experience, it may be perfectly adequate to use some conversion function like stringi::stri_enc_toutf8
on a variable/vector, test if it prints/works as expected, maybe using a regular expression filter for possibly problematic characters (as a German native we tend to look for äöüß
).
Anway, if you want to dive into the nitty gritty I can recommend looking into the stringi
package and it's encoding functions. This package is the power behind stringr
, which provides a more high-level interface.
When I have dealt with files that are not UTF-8 encoded properly, I have used iconv with great success to forcefully convert the file by simply running a bash script in my rmarkdown notebook:
iconv -c -t UTF-8 myfile.txt > Ratebeer-myfile.txt
You could also try this where file is your original file, and file-iconv is the modified file:
#iconv −f iso−8859−1 −t UTF−8 file.txt > file-iconv.txt
Verify the encoding with:
file -I file-iconv.txt
Let me know if this helps or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With