Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip unicode character modifiers

What is the simplest way to strip the character modifiers from a unicode string in Python?

For example:

A͋͠r͍̞̫̜͌ͦ̈́͐ͅt̼̭͞h́u̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ̀͏̣ should become Arthur

I tried the docs but I couldn't find anything that does this.

like image 866
Raphael Avatar asked Jun 13 '13 22:06

Raphael


People also ask

What is a Unicode character example?

Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

How do I type Unicode characters?

Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.


2 Answers

Try this

import unicodedata
a = u"STRING GOES HERE" # using an actual string would break stackoverflow's code formatting.
u"".join( x for x in a if not unicodedata.category(x).startswith("M") )

This will remove all characters classified as marks, which is what I think you want. In general, you can get the category of a character with unicodedata.category.

like image 177
cge Avatar answered Oct 16 '22 10:10

cge


You could also use r'\p{M}' that is supported by regex module:

import regex

def remove_marks(text):
    return regex.sub(ur"\p{M}+", "", text)

Example:

>>> print s
A͋͠r͍̞̫̜t̼̭͞h́u̡̙̞̘rͬͣ̐ͮ
>>> def remove_marks(text):
...     return regex.sub(ur"\p{M}+", "", text)
...     
... 
>>> print remove_marks(s)
Arthur

Depending on your use-case a whitelist approach might be better e.g., to limit the input only to ascii characters:

>>> s.encode('ascii', 'ignore').decode('ascii')
u'Arthur'

The result might depend on Unicode normalization used in the text.

like image 31
jfs Avatar answered Oct 16 '22 12:10

jfs