Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emojis to/from codepoints in Javascript

In a hybrid Android/Cordova game that I am creating I let users provide an identifier in the form of an Emoji + an alphanumeric - i.e. 0..9,A..Z,a..z - name. For example

🙋‍️Stackoverflow

Server-side the user identifiers are stored with the Emoji and Name parts separated with only the Name part requiried to be unique. From time-to-time the game displays a "league table" so the user can see how well they are performing compared to other players. For this purpose the server sends back a sequence of ten "high score" values consisting of Emoji, Name and Score.

This is then presented to the user in a table with three columns - one each for Emoji, Name and Score. And this is where I have hit a slight problem. Initially I had quite naively assumed that I could figure out the Emoji by simply looking at handle.codePointAt(0). When it dawned on me that an Emoji could in fact be a sequence of one or more 16 bit Unicode values I changed my code as follows

Part 1:Dissecting the user supplied "handle"

var i,username,
    codepoints = [], 
    handle = "🙋‍️StackOverflow",
    len = handle,length; 

 while ((i < len) && (255 < handle.codePointAt(i))) 
 {codepoints.push(handle.codePointAt(i));i += 2;}

 username = handle.substring(codepoints.length + 1);

At this point I have the "disssected" handle with

 codepoints =  [128587, 8205, 65039];
 username = 'Stackoverflow;

A note of explanation for the i += 2 and the use of handle.length above. This article suggests that

  • handle.codePointAt(n) will return the code point for the full surrogate pair if you hit the leading surrogate. In my case since the Emoji has to be first character the leading surrogates for the sequence of 16 bit Unicodes for the emoji are at 0,2,4....
  • From the same article I learnt that String.length in Javascript will return the number of 16 bit code units.

Part II - Re generating the Emojis for the "league table"

Suppose the league table data squirted back to the app by my servers has the entry {emoji: [128583, 8205, 65039],username:"Stackexchange",points:100} for the emoji character 🙇‍️. Now here is the bothersome thing. If I do

var origCP = [],
    i = 0, 
    origEmoji = '🙇‍️',
    origLen = origEmoji.length;

    while ((i < origLen) && (255 < origEmoji.codePointAt(i)) 
    {origCP.push(origEmoji.codePointAt(i);i += 2;}

I get

 origLen = 5, origCP = [128583, 8205, 65039]

However, if I regenerate the emoji from the provided data

 var reEmoji = String.fromCodePoint.apply(String,[128583, 8205, 65039]),
     reEmojiLen = reEmoji.length;

I get

reEmoji = '🙇‍️' 
reEmojiLen = 4;

So while reEmoji has the correct emoji its reported length has mysteriously shrunk down to 4 code units in place of the original 5.

If I then extract code points from the regenerated emoji

var reCP = [],
    i = 0;

while ((i < reEmojiLen) && (255 < reEmoji.codePointAt(i)) 
{reCP.push(reEmoji.codePointAt(i);i += 2;} 

which gives me

 reCP =  [128583, 8205];

Even curioser, origEmoji.codePointAt(3) gives the trailing surrogate pair value of 9794 while reEmoji.codePointAt(3) gives the value of the next full surrogate pair 65039.

I could at this point just say

Do I really care?

After all, I just want to show the league table emojis in a separate column so as long as I am getting the right emoji the niceties of what is happening under the hood do not matter. However, this might well be stocking up problems for the future.

Can anyone here shed any light on what is happening?

like image 887
DroidOS Avatar asked Nov 04 '19 10:11

DroidOS


1 Answers

emojis are more complicated than just single chars, they come in "sequences", e.g. a zwj-sequence (combine multiple emojis into one image) or a presentation sequence (provide different variations of the same symbol) and some more, see tr51 for all the nasty details.

If you "dump" your string like this

str = "🙋‍️StackOverflow"

console.log(...[...str].map(x => x.codePointAt(0).toString(16)))

you'll see that it's actually an (incorrectly formed) zwj-sequence wrapped in a presentation sequence.

So, to slice emojis accurately, you need to iterate the string as an array of codepoints (not units!) and extract plane 1 CPs (>0xffff) + ZWJ's + variation selectors. Example:

function sliceEmoji(str) {
    let res = ['', ''];

    for (let c of str) {
        let n = c.codePointAt(0);
        let isEmoji = n > 0xfff || n === 0x200d || (0xfe00 <= n && n <= 0xfeff);
        res[1 - isEmoji] += c;
    }
    return res;
}

function hex(str) {
    return [...str].map(x => x.codePointAt(0).toString(16))
}

myStr = "🙋‍️StackOverflow"

console.log(sliceEmoji(myStr))
console.log(sliceEmoji(myStr).map(hex))
like image 152
georg Avatar answered Sep 17 '22 23:09

georg