Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert decorated latin unicode characters to plain latin in python

Tags:

python

unicode

Unicode specifies a bunch of modifications you can make to latin characters. How can I convert these unicode characters to vanilla latin characters in python?

To be clear, I'm not asking how to get rid of accents from letters. I'm asking how to convert things that have linguistically the same meaning, but some decorated display, like negative, encircled, enclosed in a box types of displays.

For example, how I do I convert

πŸ’¦Β°πŸ„ΎπŸ…πŸ„ΈπŸ„ΆπŸ„ΈπŸ„½πŸ„°πŸ„»Β°πŸ’¦ c

to

πŸ’¦Β°ORIGINALΒ°πŸ’¦ c

(Stripping those non-language characters will be a separate task)

like image 465
xaviersjs Avatar asked Aug 22 '19 23:08

xaviersjs


People also ask

What is the difference between Latin and Unicode?

Unicode uses 8-, 16-, or 32-bit characters depending on the specific representation, so Unicode documents often require up to twice as much disk space as ASCII or Latin-1 documents. The first 256 characters of Unicode are identical to Latin-1.

How do you change Unicode to ASCII in Python?

In summary, to convert Unicode characters into ASCII characters, use the normalize() function from the unicodedata module and the built-in encode() function for strings. You can either ignore or replace Unicode characters that do not have ASCII counterparts.

Is python Unicode?

Web content can be written in any of these languages and can also include a variety of emoji symbols. Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters.

What is Unicode of a character?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.


1 Answers

This isn't perfect, but what you're looking for is something like Unicode Decomposition. The concept of Unicode normalization and decomposition is a book of its own.

For something quick and dirty, fortunately, Python has this built-in for you!

>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'πŸ’¦Β°πŸ„ΎπŸ…πŸ„ΈπŸ„ΆπŸ„ΈπŸ„½πŸ„°πŸ„»Β°πŸ’¦ c')
'πŸ’¦Β°ORIGINALΒ°πŸ’¦ c'
like image 118
Alyssa Haroldsen Avatar answered Sep 23 '22 08:09

Alyssa Haroldsen