In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way. R renders the following (version 3.0.2, Mac OS 10.7.5): <pre class="prettyprint"><code>> "\u00e9" [1] "é" > "\u0065\u0301" [1] "é" </code></pre> However, of course: <pre class="prettyprint"><code>> "\u00e9" == "\u0065\u0301" [1] FALSE </code></pre> Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse <code>"\u0065\u0301"</code> into <code>"\u00e9"</code>. That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via <code>iconv</code> -- at least for the usual Latin1 characters -- and is better handled by <code>plot</code>. Thanks a lot in advance.

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject. It has Unicode normalization functions, as I was looking for (here form C): <pre class="prettyprint"><code>> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301') [1] TRUE </code></pre> It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them: <pre class="prettyprint"><code>> stri_compare('\u00e9', '\u0065\u0301') [1] 0 # i.e. equal ; # otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order. </code></pre> Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

Tags:

r

encoding

unicode

unicode-normalization

latin

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.

R renders the following (version 3.0.2, Mac OS 10.7.5):

> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"

However, of course:

> "\u00e9" == "\u0065\u0301"
[1] FALSE

Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301" into "\u00e9".

That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.

Thanks a lot in advance.

726

asked Dec 08 '13 20:12

AlxH

1 Answers

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

126

answered Sep 16 '22 23:09

AlxH

Related questions
                            
                                use ggpairs to create this plot
                            
                                Returning from the '+' to '>' prompt in command line of R Studio
                            
                                R Shiny renderDataTable options
                            
                                Texture in barplot for 7 bars in R?
                            
                                S3 dispatching of `rbind` and `cbind`
                            
                                How to format a complex table for rmarkdown PDF output
                            
                                R Shiny DateRange Input month year only
                            
                                What are the dangers of using R attributes?
                            
                                Is this the expected behavior
                            
                                Reduce memory footprint of data.table with highly repeated key
                            
                                Creating a regular polygon grid over a spatial extent, rotated by a given angle
                            
                                how do I end a dplyr pipe with NULL? to allow easy comment/uncomment
                            
                                Set one or more of coefficients to a specific integer
                            
                                How to change knitr options mid chunk
                            
                                'x' is a list, but does not have components 'x' and 'y'
                            
                                Sum percentages for each facet - respect "fill"
                            
                                How can I efficiently save a python pandas dataframe in hdf5 and open it as a dataframe in R?
                            
                                53rd week of the year in R?
                            
                                Specifying column with its index rather than name
                            
                                Width of error bars in line plot using ggplot2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

Tags:

r

encoding

unicode

unicode-normalization

latin

AlxH

People also ask

1 Answers

AlxH

Recent Activity

Donate For Us