Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with strings with <U+0092> Unicode characters

I have a very large dataset (70k rows, 2600 columns, CSV format) that I have created by web scraping. Unfortunately, doing the pre-processing, processing etc. at some point some problematic characters have become encoded in an odd way and I have problems dealing with them.

I have strings like the following:

x = "but it doesn<U+0092>t matter"

Looking up the code, we can see that it should be the character , which actually should be ' (the data are user-generated so may contain all kinds of odd characters). Although from looking that character, it seems that others also have problems with it (1, 2, 3). It's labelled a control character, not sure what that is, but perhaps that's why it's so hard to deal with.

Most of the other questions about Unicode in R concern Unicode in the format like this \u0092.

Just use Encoding()

Let's try:

#> x = "but it doesn<U+0092>t matter"
#> Encoding(x)
#[1] "unknown"
#> Encoding(x) = "UTF-8"
#> Encoding(x)
#[1] "unknown"
#> x
#[1] "but it doesn<U+0092>t matter"

So this does not seem to do anything.

Use the hack functions from these previous questions

There are a few prior questions that concern this Unicode format and try to convert them:

  • Display unicode in R
  • gsub in R with unicode replacement give different results under Windows compared with Unix?

Oddly, the example they give work, but mine doesn't.

#> test.string <- "This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string."
#> Encoding(test.string)
#[1] "unknown"
#> to_true_unicode(test.string)
#[1] "This is a α β β γ test δ string."

But:

#> x2 = to_true_unicode(x)
#> x2
#[1] "but it doesn\u0092t matter"
#> cat(x2)
#but it doesnt matter
#> Encoding(x2)
#[1] "UTF-8"

So, it managed to convert to the \u format from the <U+....> format, and using cat() prints the character without that symbol (or a bugged symbol on SO).

Just search and replace them manually

I only have a limited number of these problems, so I could perhaps just use search-replace to solve it. However:

#> #base-r
#> gsub(x = x, pattern = "<U+0092>", replacement = "'")
#[1] "but it doesn<U+0092>t matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x, pattern = "<U+0092>", "'")
#[1] "but it doesn<U+0092>t matter"

So replacement does not seem to work, but it does work on the \u versions:

#> #base-r
#> gsub(x = x2, pattern = "\u0092", replacement = "'")
#[1] "but it doesn't matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x2, pattern = "\u0092", "'")
#[1] "but it doesn't matter"

So, this suggests a working method: 1) convert <U+> format to \u format, then use search-replace.

Unescape with stringi::stri_unescape_unicode()

Does not seem to work with either version:

#> stringi::stri_unescape_unicode(x)
#[1] "but it doesn<U+0092>t matter"
#> stringi::stri_unescape_unicode(x2)
#[1] "but it doesn\u0092t matter"

Is there some generally applicable way to deal with problems like this?

My setup

My sessionInfo is:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.3   stringi_1.0-1

Running R via RStudio (0.99.893, preview) on Windows 8.1, 64-bit. Keyboard and time-units are Danish, but everything else is in English.

like image 880
CoderGuy123 Avatar asked Mar 20 '16 00:03

CoderGuy123


People also ask

Why do some Unicode characters not show up?

If you are unable to read some Unicode characters in your browser, it may be because your system is not properly configured. Here are some basic instructions for doing that. There are two basic steps: Install fonts that cover the characters you need.

Does string support Unicode?

@MSalters: std::string can hold 100% of all Unicode characters, even if CHAR_BIT is 8. It depends on the encoding of std::string, which may be UTF-8 on the system level (like almost everywhere except for windows) or on your application level.

Can Java strings handle Unicode character strings?

Internally in Java all strings are kept in Unicode. Since not all text received from users or the outside world is in unicode, your application may have to convert from non-unicode to unicode.


1 Answers

I've had a bit of a horrible time with this pernicious little problem, but I think/hope I've finally got somewhere.

After messing around with the read_csv options locale=locale(encoding="xyz") and trying various combinations of other solutions - the gsub solution didn't work, I treid the stringi solution...

It didn't work, either. But it has a function str_enc_detect, which I ran on the problem values stri_enc_detect(x). It gave me a locale I hadn't tried - in this case windows-1252 - which I promptly set in read_csv options: locale=locale(encoding = "windows-1252")

Hey presto it's displaying correctly now.

like image 183
gladys_c_hugh Avatar answered Nov 09 '22 03:11

gladys_c_hugh