I have a very large dataset (70k rows, 2600 columns, CSV format) that I have created by web scraping. Unfortunately, doing the pre-processing, processing etc. at some point some problematic characters have become encoded in an odd way and I have problems dealing with them.
I have strings like the following:
x = "but it doesn<U+0092>t matter"
Looking up the code, we can see that it should be the character ’
, which actually should be '
(the data are user-generated so may contain all kinds of odd characters). Although from looking that character, it seems that others also have problems with it (1, 2, 3). It's labelled a control character, not sure what that is, but perhaps that's why it's so hard to deal with.
Most of the other questions about Unicode in R concern Unicode in the format like this \u0092
.
Encoding()
Let's try:
#> x = "but it doesn<U+0092>t matter"
#> Encoding(x)
#[1] "unknown"
#> Encoding(x) = "UTF-8"
#> Encoding(x)
#[1] "unknown"
#> x
#[1] "but it doesn<U+0092>t matter"
So this does not seem to do anything.
There are a few prior questions that concern this Unicode format and try to convert them:
Oddly, the example they give work, but mine doesn't.
#> test.string <- "This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string."
#> Encoding(test.string)
#[1] "unknown"
#> to_true_unicode(test.string)
#[1] "This is a α β β γ test δ string."
But:
#> x2 = to_true_unicode(x)
#> x2
#[1] "but it doesn\u0092t matter"
#> cat(x2)
#but it doesnt matter
#> Encoding(x2)
#[1] "UTF-8"
So, it managed to convert to the \u
format from the <U+....>
format, and using cat()
prints the character without that symbol (or a bugged symbol on SO).
I only have a limited number of these problems, so I could perhaps just use search-replace to solve it. However:
#> #base-r
#> gsub(x = x, pattern = "<U+0092>", replacement = "'")
#[1] "but it doesn<U+0092>t matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x, pattern = "<U+0092>", "'")
#[1] "but it doesn<U+0092>t matter"
So replacement does not seem to work, but it does work on the \u
versions:
#> #base-r
#> gsub(x = x2, pattern = "\u0092", replacement = "'")
#[1] "but it doesn't matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x2, pattern = "\u0092", "'")
#[1] "but it doesn't matter"
So, this suggests a working method: 1) convert <U+>
format to \u
format, then use search-replace.
stringi::stri_unescape_unicode()
Does not seem to work with either version:
#> stringi::stri_unescape_unicode(x)
#[1] "but it doesn<U+0092>t matter"
#> stringi::stri_unescape_unicode(x2)
#[1] "but it doesn\u0092t matter"
Is there some generally applicable way to deal with problems like this?
My sessionInfo is:
> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.3 stringi_1.0-1
Running R via RStudio (0.99.893, preview) on Windows 8.1, 64-bit. Keyboard and time-units are Danish, but everything else is in English.
If you are unable to read some Unicode characters in your browser, it may be because your system is not properly configured. Here are some basic instructions for doing that. There are two basic steps: Install fonts that cover the characters you need.
@MSalters: std::string can hold 100% of all Unicode characters, even if CHAR_BIT is 8. It depends on the encoding of std::string, which may be UTF-8 on the system level (like almost everywhere except for windows) or on your application level.
Internally in Java all strings are kept in Unicode. Since not all text received from users or the outside world is in unicode, your application may have to convert from non-unicode to unicode.
I've had a bit of a horrible time with this pernicious little problem, but I think/hope I've finally got somewhere.
After messing around with the read_csv
options locale=locale(encoding="xyz")
and trying various combinations of other solutions - the gsub
solution didn't work, I treid the stringi
solution...
It didn't work, either. But it has a function str_enc_detect
, which I ran on the problem values stri_enc_detect(x)
. It gave me a locale I hadn't tried - in this case windows-1252 - which I promptly set in read_csv options: locale=locale(encoding = "windows-1252")
Hey presto it's displaying correctly now.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With