Hello I retrieve text based utf8 data from a foreign source which contains special chars such as <code>u"ıöüç"</code> while I want to normalize them to English such as <code>"ıöüç"</code> -> <code>"iouc"</code> . What would be the best way to achieve this ?

I recommend using Unidecode module: <pre class="prettyprint"><code>>>> from unidecode import unidecode >>> unidecode(u'ıöüç') 'iouc' </code></pre> Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.

It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (<code>αβγ</code> to <code>abg</code>) then <code>unidecode</code> is the way to go. If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter <code>á</code> to a plain letter <code>a</code> followed by <code>U+0301 COMBINING ACUTE ACCENT</code>) and then discarding the accents (which belong to the Unicode character class <code>Mn</code> — "Mark, nonspacing"). <pre class="prettyprint"><code>import unicodedata def remove_nonspacing_marks(s): "Decompose the unicode string s and remove non-spacing marks." return ''.join(c for c in unicodedata.normalize('NFKD', s) if unicodedata.category(c) != 'Mn') </code></pre>

Python and character normalization

Tags:

Hello I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ?

987

asked Nov 12 '10 07:11

Hellnar

2 Answers

I recommend using Unidecode module:

>>> from unidecode import unidecode >>> unidecode(u'ıöüç') 'iouc'

Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.

134

answered Oct 15 '22 00:10

Constantin

It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.

If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").

import unicodedata  def remove_nonspacing_marks(s):     "Decompose the unicode string s and remove non-spacing marks."     return ''.join(c for c in unicodedata.normalize('NFKD', s)                    if unicodedata.category(c) != 'Mn')

answered Oct 15 '22 02:10

Gareth Rees

Related questions
                            
                                convert Decimal array to Double array
                            
                                Is there a command to halt the interpreter in Common Lisp?
                            
                                How to convert a char from an alphabetical character to a hexadecimal number in Java?
                            
                                pass method with template arguments to a macro
                            
                                How to show the numeric keypad
                            
                                Emacs query-replace with newlines [duplicate]
                            
                                Calling derived class function from base class
                            
                                How can a nested class access a method in the outer class in Ruby?
                            
                                UTF8 character decoding in Objective C
                            
                                How to remove XML tags from Unix command line?
                            
                                view a particular line of a file denoted by a number
                            
                                How to display image with JavaScript?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With