Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

In Unicode, letters with accents can be represented in two ways: the accentuated letter itself, and the combination of the bare letter plus the accent. For example, é (+U00E9) and e´ (+U0065 +U0301) are usually displayed in the same way.

R renders the following (version 3.0.2, Mac OS 10.7.5):

> "\u00e9"
[1] "é"
> "\u0065\u0301"
[1] "é"

However, of course:

> "\u00e9" == "\u0065\u0301"
[1] FALSE

Is there a function in R which converts two-unicode-character-letters into their one-character form? In particular, here it would collapse "\u0065\u0301" into "\u00e9".

That would be extremely handy to process large quantities of strings. Plus, the one-character forms can easily be converted to other encodings via iconv -- at least for the usual Latin1 characters -- and is better handled by plot.

Thanks a lot in advance.

like image 726
AlxH Avatar asked Dec 08 '13 20:12

AlxH


People also ask

What is Unicode normalization form?

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.

What is Unicode Normalization in NLP?

Unicode normalization is our solution to both canonical and compatibility equivalence issues. In normalization, there are two directions and two types of conversions we can make. The two types we have already covered, canonical and compatibility.

What is character normalization?

Character normalization is a process that can improve recall. Improving recall by character normalization means that more documents are retrieved even if the documents do not exactly match the query.

What is the best way to remove accents normalize in a Python Unicode string?

Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.


1 Answers

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

like image 126
AlxH Avatar answered Sep 16 '22 23:09

AlxH