Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Twitter emoji encoding problems with twitteR and R

I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.

The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.

Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:

U+1F4AF

Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.

\xed��\xed��

Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:

tweet$toDataFrame()

The emoji becomes this:

<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>

I tried to convert it with the function iconv in R, with the following code:

iconv(tweet$text, from="UTF-8", to="ASCII", "byte)

and I only manage to make it look like this:

<ed><a0><bd><ed><b2><af>

So, wrapping up and at the end of my tests, I got to the following results:

<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��

None of which look like the code point specified by the table:

U+1F4AF

Is there any possibility to transform between the two strings? What am I missing? Why is twitter returning this information for emojis?

like image 545
Ed. Avatar asked Jun 23 '16 19:06

Ed.


People also ask

What encoding should I use for emojis?

emoji requires 4 bytes to encode using UTF-8, we now see 4 characters when we interpret the file with the Windows-1258 encoding. A wrong choice of character encoding has a direct impact on what we can see and comprehend by garbling characters into an incomprehensible mess.

How do you change the emojis on Twitter 2022?

From there, open the overflow menu by either tapping on your avatar in the top left corner or by swiping inward from the left side. Next, select Settings and privacy. Now, locate and chose Display and sound under the General subheading. As you should now see, there is a new Emoji option.

Do emojis work on Twitter?

Twemoji for Twitter Unfortunately, there are no keyboard shortcuts for inserting emoji in tweets. The only option we see is to copy twemoji symbols and paste on the new tweet or when replying to a tweet. Twemoji is an open source emoji for Twitter supporting all Unicode emoji symbols.


1 Answers

I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.

You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF. The conversions you show are not different encodings but different notation for the same encoded emoji:

  1. as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
  2. iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.

So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from a dictionary that contains fewer emoticons.

The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.

Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.

Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).

So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.

As an example:

unicode <- 0x1F4Af

# Multibyte Version
intToUtf8(unicode)

# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)

Returns:

[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"

Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:

[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"

PS1.: Function unicode2hilo is a simple linear transformation of hi-lo to unicode

unicode2hilo <- function(unicode){
   hi = floor((unicode - 0x10000)/0x400) + 0xd800
   lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
   hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
   return(hilo)
}

hilo2unicode <- function(hi,lo){
   unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000 
   unicode = paste('0x', as.hexmode(unicode), sep = '')
   return(unicode)
}

PS2.: I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.

PS3.: To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.

like image 145
Felipe Suárez Colmenares Avatar answered Nov 15 '22 04:11

Felipe Suárez Colmenares