Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting emoji, safely

I'm attempting to split a string into single words/chars, but I'm having trouble when it comes to emoji.

First of all, I can't simply split the string using an empty character because emojis are generally have length >= 2.

"😎".split("")
["�", "�"]

I found an emoji regex that mostly works, but now I am seeing some strange flesh-colored blocks. I even see them show up on twitter in some cases.

enter image description here

Here's a pen that illustrates the problem with the fleshy blocks http://codepen.io/positlabs/pen/QyEOEG?editors=011

enter image description here

UPDATE -----------

Trying out spliddit, and I'm still seeing the issue with the skin tone characters. Is there some way to glue them back together?

http://codepen.io/positlabs/pen/rxLqwL?editors=001

like image 911
posit labs Avatar asked Dec 22 '15 18:12

posit labs


2 Answers

JavaScript's strings are UTF-16, so your emoji is internally represented as two code units:

> "\ud83d\ude0e" === "😎"
true

The String.prototype.split function doesn't really care about surrogate pairs in UTF-16, so it naively reverses the individual code units and breaks your emoji, because JavaScript doesn't provide any way to deal with individual characters in strings.

There's no easy way to deal with it. You need a library like spliddit to handle the individual code units properly.

I'm not 100% familiar with the terminology, so please edit my answer as needed.

like image 107
Blender Avatar answered Nov 13 '22 15:11

Blender


spliddit can't currently correctly split for example this Hindi text into its 5 characters: "अनुच्छेद"

You need the grapheme-splitter library: https://github.com/orling/grapheme-splitter It is a full implementation of the UAX-29 Unicode standard and will split even the most exotic letters, emoji being just one of many use cases

like image 23
Orlin Georgiev Avatar answered Nov 13 '22 15:11

Orlin Georgiev