Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python3 src encodings of Emojis

I'd like to print emojis from python(3) src

I'm working on a project that analyzes Facebook Message histories and in the raw htm data file downloaded I find a lot of emojis are displayed as boxes with question marks, as happens when the value can't be displayed. If I copy paste these symbols into terminal as strings, I get values such as \U000fe328. This is also the output I'm getting when I run the htm files through BeautifulSoup and output the data.

I Googled this string (and others), and consistently one of the only sites that comes up with them is iemoji.com, in the case of the above string this page, that lists the string as a Python Src. I want to be able to print out these strings as their corresponding emojis (after all, they were originallly emojis when being messaged), and after looking around I found a mapping of src encodings at this page, that mapped the above like strings to emoji string names. I then found this emoji string names to Unicode list, that for the most part seems to map the emoji names to Unicode. If I try printing out these values, I get good output. Like following

>>> print(u'\U0001F624')
šŸ˜¤

Is there a way to map these "Python src" encodings to their unicode values? Chaining both libraries would work if not for the fact that the original src mapping is missing around 50% of the unicode values found in the unicode library. And if I do end up having to do that, is there a good way to find the Python Src value of a given emoji? From my testing emoji as strings equal their Unicode, such as 'šŸ˜¤' == u'\U0001F624', but I can't seem to get any sort of relations to \U000fe328

like image 432
User Avatar asked Aug 05 '16 02:08

User


1 Answers

This has nothing to do with Python. An escape like \U000fe328 just contains the hexadecimal representation of a code point, so this one is U+0FE328 (which is a private use character).

These days a lot of emoji are assigned to code points, eg. šŸ˜¤ is U+01F624 ā€” FACE WITH LOOK OF TRIUMPH.

Before these were assigned, various programs used various code points in the private use ranges to represent emoji. Facebook apparently used the private use character U+0FE328. The mapping from these code points to the standard code points is arbitrary. Some of them may not have a standard equivalent at all.

So what you have to look for is a table which tells you which of these old assignments correspond to which standard code point.

There's php-emoji on GitHub which appears to contain these mappings. But note that this is PHP code, and the characters are represented as UTF-8 (eg. the character above would be "\xf3\xbe\x8c\xa8").

like image 178
roeland Avatar answered Sep 20 '22 03:09

roeland