What is the simplest way to strip the character modifiers from a unicode string in Python? For example: A͋͠r͍̞̫̜͌ͦ̈́͐ͅt̼̭͞h́u̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ̀͏̣ should become Arthur I tried the docs but I couldn't find anything that does this.

Try this <pre class="prettyprint"><code>import unicodedata a = u"STRING GOES HERE" # using an actual string would break stackoverflow's code formatting. u"".join( x for x in a if not unicodedata.category(x).startswith("M") ) </code></pre> This will remove all characters classified as marks, which is what I think you want. In general, you can get the category of a character with unicodedata.category.

You could also use <code>r'\p{M}'</code> that is supported by regex module: <pre class="prettyprint"><code>import regex def remove_marks(text): return regex.sub(ur"\p{M}+", "", text) </code></pre> Example: <pre class="prettyprint"><code>>>> print s A͋͠r͍̞̫̜t̼̭͞h́u̡̙̞̘rͬͣ̐ͮ >>> def remove_marks(text): ... return regex.sub(ur"\p{M}+", "", text) ... ... >>> print remove_marks(s) Arthur </code></pre> Depending on your use-case a whitelist approach might be better e.g., to limit the input only to ascii characters: <pre class="prettyprint"><code>>>> s.encode('ascii', 'ignore').decode('ascii') u'Arthur' </code></pre> The result might depend on Unicode normalization used in the text.

Strip unicode character modifiers

What is the simplest way to strip the character modifiers from a unicode string in Python?

For example:

A͋͠r͍̞̫̜͌ͦ̈́͐ͅt̼̭͞h́u̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ̀͏̣ should become Arthur

I tried the docs but I couldn't find anything that does this.

What is a Unicode character example?

Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

How do I type Unicode characters?

Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.

Try this

import unicodedata
a = u"STRING GOES HERE" # using an actual string would break stackoverflow's code formatting.
u"".join( x for x in a if not unicodedata.category(x).startswith("M") )

This will remove all characters classified as marks, which is what I think you want. In general, you can get the category of a character with unicodedata.category.

You could also use r'\p{M}' that is supported by regex module:

import regex

def remove_marks(text):
    return regex.sub(ur"\p{M}+", "", text)

Example:

>>> print s
A͋͠r͍̞̫̜t̼̭͞h́u̡̙̞̘rͬͣ̐ͮ
>>> def remove_marks(text):
...     return regex.sub(ur"\p{M}+", "", text)
...     
... 
>>> print remove_marks(s)
Arthur

Depending on your use-case a whitelist approach might be better e.g., to limit the input only to ascii characters:

>>> s.encode('ascii', 'ignore').decode('ascii')
u'Arthur'

The result might depend on Unicode normalization used in the text.

Strip unicode character modifiers

Tags:

python

unicode

utf-8

Raphael

People also ask

2 Answers

cge

jfs

Recent Activity

Donate For Us

Strip unicode character modifiers

Tags:

python

unicode

utf-8

Raphael

People also ask

2 Answers

cge

jfs

Related questions

Recent Activity

Donate For Us