What is the simplest way to strip the character modifiers from a unicode string in Python?
For example:
A͋͠r͍̞̫̜͌ͦ̈́͐ͅt̼̭͞h́u̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ̀͏̣ should become Arthur
I tried the docs but I couldn't find anything that does this.
Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).
Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.
Try this
import unicodedata
a = u"STRING GOES HERE" # using an actual string would break stackoverflow's code formatting.
u"".join( x for x in a if not unicodedata.category(x).startswith("M") )
This will remove all characters classified as marks, which is what I think you want. In general, you can get the category of a character with unicodedata.category.
You could also use r'\p{M}'
that is supported by regex module:
import regex
def remove_marks(text):
return regex.sub(ur"\p{M}+", "", text)
Example:
>>> print s
A͋͠r͍̞̫̜t̼̭͞h́u̡̙̞̘rͬͣ̐ͮ
>>> def remove_marks(text):
... return regex.sub(ur"\p{M}+", "", text)
...
...
>>> print remove_marks(s)
Arthur
Depending on your use-case a whitelist approach might be better e.g., to limit the input only to ascii characters:
>>> s.encode('ascii', 'ignore').decode('ascii')
u'Arthur'
The result might depend on Unicode normalization used in the text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With