Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - replace unicode emojis with ASCII characters

I have an issue with one of my current weekend projects. I am writing a Python script that fetches some data from different sources and then spits everything out to an esc-pos printer. As you might imagine pos printers don't exactly like emojis...

So text like this:

可爱!!!!!!!!😍😍😍😍😍😍😍😝

gives me this character string:

'\u53ef\u7231!!!!!!!!\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f61d'

The result that comes out of the printer is quite different than what I would like of course. So I need to replace these non-ASCII characters with something else. I don't really care for the first characters, but I do care about emojis. Using something like: unidecode(str(text)) will at least strip them out, but I want to convert them to something more useful. Either into classic smilies like [:-D] or into [SMILING FACE WITH HEART-SHAPED EYES].

My problem is... how would one go about doing this? Manually creating a lookup table for most common emojis seems a bit tedious, so I am wondering if there is something else that I can do.

like image 515
user3082900 Avatar asked May 05 '17 05:05

user3082900


2 Answers

With the tip about unicodedata.name and some further research I managed to put this thing together:

import unicodedata
from unidecode import unidecode

def deEmojify(inputString):
    returnString = ""

    for character in inputString:
        try:
            character.encode("ascii")
            returnString += character
        except UnicodeEncodeError:
            replaced = unidecode(str(character))
            if replaced != '':
                returnString += replaced
            else:
                try:
                     returnString += "[" + unicodedata.name(character) + "]"
                except ValueError:
                     returnString += "[x]"

    return returnString

Basically it first tries to find the most appropriate ascii representation, if that fails it tries using the unicode name, and if even that fails it simply replaces it with some simple marker.

For example Taking this string:

abcdΕ‘eΔ‘fčgΕΎhΓ…iØjΓ†kο£Ώ 可爱!!!!!!!!😍😍😍😍😍😍😍😝

And running the function:

string = u'abcdΕ‘eΔ‘fčgΕΎhΓ…iØjΓ†kο£Ώ \u53ef\u7231!!!!!!!!\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f60d\U0001f61d'
print(deEmojify(string))

Will produce the following result:

abcdsedfcgzhAiOjAEk[x] Ke Ai !!!!!!!![SMILING FACE WITH HEART-SHAPED EYES][SMILING FACE WITH HEART-SHAPED EYES][SMILING FACE WITH HEART-SHAPED EYES][SMILING FACE WITH HEART-SHAPED EYES][SMILING FACE WITH HEART-SHAPED EYES][SMILING FACE WITH HEART-SHAPED EYES][SMILING FACE WITH HEART-SHAPED EYES][FACE WITH STUCK-OUT TONGUE AND TIGHTLY-CLOSED EYES]

like image 96
user3082900 Avatar answered Oct 31 '22 23:10

user3082900


Try this

import unicodedata
print( unicodedata.name(u'\U0001f60d'))

result is

SMILING FACE WITH HEART-SHAPED EYES
like image 43
BoarGules Avatar answered Oct 31 '22 23:10

BoarGules