<p>I have a very large dataset (70k rows, 2600 columns, CSV format) that I have created by web scraping. Unfortunately, doing the pre-processing, processing etc. at some point some problematic characters have become encoded in an odd way and I have problems dealing with them.</p> <p>I have strings like the following:</p> <pre class="prettyprint"><code>x = "but it doesn<U+0092>t matter" </code></pre> <p>Looking up the code, we can see that it should be the character <code>’</code>, which actually should be <code>'</code> (the data are user-generated so may contain all kinds of odd characters). Although from looking that character, it seems that others also have problems with it (1, 2, 3). It's labelled a control character, not sure what that is, but perhaps that's why it's so hard to deal with.</p> <p>Most of the other questions about Unicode in R concern Unicode in the format like this <code>\u0092</code>.</p> <h3>Just use <code>Encoding()</code> </h3> <p>Let's try:</p> <pre class="prettyprint"><code>#> x = "but it doesn<U+0092>t matter" #> Encoding(x) #[1] "unknown" #> Encoding(x) = "UTF-8" #> Encoding(x) #[1] "unknown" #> x #[1] "but it doesn<U+0092>t matter" </code></pre> <p>So this does not seem to do anything.</p> <h3>Use the hack functions from these previous questions</h3> <p>There are a few prior questions that concern this Unicode format and try to convert them:</p> <ul> <li>Display unicode in R</li> <li>gsub in R with unicode replacement give different results under Windows compared with Unix?</li> </ul> <p>Oddly, the example they give work, but mine doesn't.</p> <pre class="prettyprint"><code>#> test.string <- "This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string." #> Encoding(test.string) #[1] "unknown" #> to_true_unicode(test.string) #[1] "This is a α β β γ test δ string." </code></pre> <p>But:</p> <pre class="prettyprint"><code>#> x2 = to_true_unicode(x) #> x2 #[1] "but it doesn\u0092t matter" #> cat(x2) #but it doesnt matter #> Encoding(x2) #[1] "UTF-8" </code></pre> <p>So, it managed to convert to the <code>\u</code> format from the <<code>U+....></code> format, and using <code>cat()</code> prints the character without that symbol (or a bugged symbol on SO).</p> <h3>Just search and replace them manually</h3> <p>I only have a limited number of these problems, so I could perhaps just use search-replace to solve it. However:</p> <pre class="prettyprint"><code>#> #base-r #> gsub(x = x, pattern = "<U+0092>", replacement = "'") #[1] "but it doesn<U+0092>t matter" #> #stringr/stringi #> library(stringr) #> str_replace(x, pattern = "<U+0092>", "'") #[1] "but it doesn<U+0092>t matter" </code></pre> <p>So replacement does not seem to work, but it does work on the <code>\u</code> versions:</p> <pre class="prettyprint"><code>#> #base-r #> gsub(x = x2, pattern = "\u0092", replacement = "'") #[1] "but it doesn't matter" #> #stringr/stringi #> library(stringr) #> str_replace(x2, pattern = "\u0092", "'") #[1] "but it doesn't matter" </code></pre> <p>So, this suggests a working method: 1) convert <code><U+></code> format to <code>\u</code> format, then use search-replace.</p> <h3>Unescape with <code>stringi::stri_unescape_unicode()</code> </h3> <p>Does not seem to work with either version:</p> <pre class="prettyprint"><code>#> stringi::stri_unescape_unicode(x) #[1] "but it doesn<U+0092>t matter" #> stringi::stri_unescape_unicode(x2) #[1] "but it doesn\u0092t matter" </code></pre> <p>Is there some generally applicable way to deal with problems like this?</p> <h3>My setup</h3> <p>My sessionInfo is:</p> <pre class="prettyprint"><code>> sessionInfo() R version 3.2.3 (2015-12-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) locale: [1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 [4] LC_NUMERIC=C LC_TIME=Danish_Denmark.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] stringr_1.0.0 loaded via a namespace (and not attached): [1] magrittr_1.5 tools_3.2.3 stringi_1.0-1 </code></pre> <p>Running R via RStudio (0.99.893, preview) on Windows 8.1, 64-bit. Keyboard and time-units are Danish, but everything else is in English.</p>

<p>I've had a bit of a horrible time with this pernicious little problem, but I think/hope I've finally got somewhere.</p> <p>After messing around with the <code>read_csv</code> options <code>locale=locale(encoding="xyz")</code> and trying various combinations of other solutions - the <code>gsub</code> solution didn't work, I treid the <code>stringi</code> solution... </p> <p>It didn't work, either. But it has a function <code>str_enc_detect</code>, which I ran on the problem values <code>stri_enc_detect(x)</code>. It gave me a locale I hadn't tried - in this case windows-1252 - which I promptly set in read_csv options: <code>locale=locale(encoding = "windows-1252")</code></p> <p>Hey presto it's displaying correctly now. </p>

Trouble with strings with <U+0092> Unicode characters

Tags:

r

encoding

unicode

I have a very large dataset (70k rows, 2600 columns, CSV format) that I have created by web scraping. Unfortunately, doing the pre-processing, processing etc. at some point some problematic characters have become encoded in an odd way and I have problems dealing with them.

I have strings like the following:

x = "but it doesn<U+0092>t matter"

Looking up the code, we can see that it should be the character ’, which actually should be ' (the data are user-generated so may contain all kinds of odd characters). Although from looking that character, it seems that others also have problems with it (1, 2, 3). It's labelled a control character, not sure what that is, but perhaps that's why it's so hard to deal with.

Most of the other questions about Unicode in R concern Unicode in the format like this \u0092.

Just use `Encoding()`

Let's try:

#> x = "but it doesn<U+0092>t matter"
#> Encoding(x)
#[1] "unknown"
#> Encoding(x) = "UTF-8"
#> Encoding(x)
#[1] "unknown"
#> x
#[1] "but it doesn<U+0092>t matter"

So this does not seem to do anything.

Use the hack functions from these previous questions

There are a few prior questions that concern this Unicode format and try to convert them:

Display unicode in R
gsub in R with unicode replacement give different results under Windows compared with Unix?

Oddly, the example they give work, but mine doesn't.

#> test.string <- "This is a <U+03B1> <U+03B2> <U+03B2> <U+03B3> test <U+03B4> string."
#> Encoding(test.string)
#[1] "unknown"
#> to_true_unicode(test.string)
#[1] "This is a α β β γ test δ string."

But:

#> x2 = to_true_unicode(x)
#> x2
#[1] "but it doesn\u0092t matter"
#> cat(x2)
#but it doesnt matter
#> Encoding(x2)
#[1] "UTF-8"

So, it managed to convert to the \u format from the <U+....> format, and using cat() prints the character without that symbol (or a bugged symbol on SO).

Just search and replace them manually

I only have a limited number of these problems, so I could perhaps just use search-replace to solve it. However:

#> #base-r
#> gsub(x = x, pattern = "<U+0092>", replacement = "'")
#[1] "but it doesn<U+0092>t matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x, pattern = "<U+0092>", "'")
#[1] "but it doesn<U+0092>t matter"

So replacement does not seem to work, but it does work on the \u versions:

#> #base-r
#> gsub(x = x2, pattern = "\u0092", replacement = "'")
#[1] "but it doesn't matter"
#> #stringr/stringi
#> library(stringr)
#> str_replace(x2, pattern = "\u0092", "'")
#[1] "but it doesn't matter"

So, this suggests a working method: 1) convert <U+> format to \u format, then use search-replace.

Unescape with `stringi::stri_unescape_unicode()`

Does not seem to work with either version:

#> stringi::stri_unescape_unicode(x)
#[1] "but it doesn<U+0092>t matter"
#> stringi::stri_unescape_unicode(x2)
#[1] "but it doesn\u0092t matter"

Is there some generally applicable way to deal with problems like this?

My setup

My sessionInfo is:

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=Danish_Denmark.1252  LC_CTYPE=Danish_Denmark.1252    LC_MONETARY=Danish_Denmark.1252
[4] LC_NUMERIC=C                    LC_TIME=Danish_Denmark.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5  tools_3.2.3   stringi_1.0-1

Running R via RStudio (0.99.893, preview) on Windows 8.1, 64-bit. Keyboard and time-units are Danish, but everything else is in English.

880

asked Mar 20 '16 00:03

CoderGuy123

1 Answers

I've had a bit of a horrible time with this pernicious little problem, but I think/hope I've finally got somewhere.

After messing around with the read_csv options locale=locale(encoding="xyz") and trying various combinations of other solutions - the gsub solution didn't work, I treid the stringi solution...

It didn't work, either. But it has a function str_enc_detect, which I ran on the problem values stri_enc_detect(x). It gave me a locale I hadn't tried - in this case windows-1252 - which I promptly set in read_csv options: locale=locale(encoding = "windows-1252")

Hey presto it's displaying correctly now.

183

answered Nov 09 '22 03:11

gladys_c_hugh

Related questions
                            
                                Read a file in R with mixed character encodings
                            
                                How to correlate two time series with gaps and different time bases?
                            
                                pictorial chart in r
                            
                                How to get the file name of the R script currently being executed (for easy automatic email of results) [duplicate]
                            
                                How to set knitr chunk output width on a per chunk basis?
                            
                                Proper way to implement S3 dispatch on R6 classes
                            
                                How to draw a power curve using ggplot2
                            
                                ggplot2 histogram with density curve that sums to 1 [closed]
                            
                                X11 is not available in R
                            
                                testthat fails within devtools::check but works in devtools::test
                            
                                Structure of lists in foreach package
                            
                                Packaging supporting R code in a python module?
                            
                                How do you undo a setkey ordering in data.table?
                            
                                Significance level of ACF and PACF in R
                            
                                SparkR filterRDD and flatMap not working
                            
                                Are rCharts and DT compatible in rmarkdown?
                            
                                Enabling vignette compression for R CMD build in RStudio
                            
                                Unexpected Convolution Results
                            
                                What does "argument to 'which' is not logical" mean in FactoMineR MCA?
                            
                                How to move out of auto-completed quotes or parentheses in RStudio?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trouble with strings with <U+0092> Unicode characters

Tags:

r

encoding

unicode

Just use `Encoding()`

Use the hack functions from these previous questions

Just search and replace them manually

Unescape with `stringi::stri_unescape_unicode()`

My setup

CoderGuy123

People also ask

1 Answers

gladys_c_hugh

Recent Activity

Donate For Us

Trouble with strings with <U+0092> Unicode characters

Tags:

r

encoding

unicode

Just use Encoding()

Use the hack functions from these previous questions

Just search and replace them manually

Unescape with stringi::stri_unescape_unicode()

My setup

CoderGuy123

People also ask

1 Answers

gladys_c_hugh

Related questions

Recent Activity

Donate For Us

Just use `Encoding()`

Unescape with `stringi::stri_unescape_unicode()`