Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do Unicode emoji property escapes match numbers?

I found this awesome way to detect emojis using a regex that doesn't use "huge magic ranges" by using a Unicode property escape:

console.log(/\p{Emoji}/u.test('flowers 🌼🌺🌸')) // true
console.log(/\p{Emoji}/u.test('flowers')) // false

But when I shared this knowledge in this answer, @Bronzdragon noticed that \p{Emoji} also matches numbers! Why is that? Numbers are not emojis?

console.log(/\p{Emoji}/u.test('flowers 123')) // unexpectdly true

// regex-only workaround by @Bonzdragon
const regex = /(?=\p{Emoji})(?!\p{Number})/u;
console.log(
  regex.test('flowers'), // false, as expected
  regex.test('flowers 123'), // false, as expected
  regex.test('flowers 123 🌼🌺🌸'), // true, as expected
  regex.test('flowers 🌼🌺🌸'), // true, as expected
)

// more readable workaround
const hasEmoji = str => {
  const nbEmojiOrNumber = (str.match(/\p{Emoji}/gu) || []).length;
  const nbNumber = (str.match(/\p{Number}/gu) || []).length;
  return nbEmojiOrNumber > nbNumber;
}
console.log(
  hasEmoji('flowers'), // false, as expected
  hasEmoji('flowers 123'), // false, as expected
  hasEmoji('flowers 123 🌼🌺🌸'), // true, as expected
  hasEmoji('flowers 🌼🌺🌸'), // true, as expected
)
like image 384
Nino Filiu Avatar asked Oct 16 '20 12:10

Nino Filiu


People also ask

Can I use Unicode property escapes?

It is possible to use both short or long forms in Unicode property escapes. They can be used to match letters, numbers, symbols, punctuations, spaces, etc.

Do Emojis have Unicode values?

Because emoji characters are treated as pictographs, they are encoded in Unicode based primarily on their general appearance, not on an intended semantic. The meaning of each emoji can vary depending on language, culture, context, and may change or be repurposed by various groups over time.

Does regex work with Emojis?

emoji-regex offers a regular expression to match all emoji symbols and sequences (including textual representations of emoji) as per the Unicode Standard.

Is utf8 an emoji?

Emojis look like images, or icons, but they are not. They are letters (characters) from the UTF-8 (Unicode) character set.


1 Answers

According to this post, digtis, #, *, ZWJ and some more chars contain the Emoji property set to Yes, which means digits are considered valid emoji chars:

0023          ; Emoji_Component      #  1.1  [1] (#️)       number sign
002A          ; Emoji_Component      #  1.1  [1] (*️)       asterisk
0030..0039    ; Emoji_Component      #  1.1 [10] (0️..9️)    digit zero..digit nine
200D          ; Emoji_Component      #  1.1  [1] (‍)        zero width joiner
20E3          ; Emoji_Component      #  3.0  [1] (⃣)       combining enclosing keycap
FE0F          ; Emoji_Component      #  3.2  [1] ()        VARIATION SELECTOR-16
1F1E6..1F1FF  ; Emoji_Component      #  6.0 [26] (🇦..🇿)    regional indicator symbol letter a..regional indicator symbol letter z
1F3FB..1F3FF  ; Emoji_Component      #  8.0  [5] (🏻..🏿)    light skin tone..dark skin tone
1F9B0..1F9B3  ; Emoji_Component      # 11.0  [4] (🦰..🦳)    red-haired..white-haired
E0020..E007F  ; Emoji_Component      #  3.1 [96] (󠀠..󠁿)      tag space..cancel tag

For example, 1 is a digit, but it becomes an emoji when combined with U+FE0F and U+20E3 chars: 1️⃣:

console.log("1\uFE0F\u20E3 2\uFE0F\u20E3 3\uFE0F\u20E3 4\uFE0F\u20E3 5\uFE0F\u20E3 6\uFE0F\u20E3 7\uFE0F\u20E3 8\uFE0F\u20E3 9\uFE0F\u20E3 0\uFE0F\u20E3")

If you want to avoid matching digits, use Extended_Pictographic Unicode category class:

The Extended_Pictographic characters contain all the Emoji characters except for some Emoji_Components.

So, you may use either /\p{Extended_Pictographic}/gu to most emojis proper, or /\p{Extended_Pictographic}/u to test for a single emoji proper, or use /[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/u to match emojis proper and light skin to dark skin mode chars and red-haired to white-haired chars:

const regex_emoji = /[\p{Extended_Pictographic}\u{1F3FB}-\u{1F3FF}\u{1F9B0}-\u{1F9B3}]/u;
console.log( regex_emoji.test('flowers 123') );     // => false
console.log( regex_emoji.test('flowers 🌼🌺🌸') ); // => true
like image 54
Wiktor Stribiżew Avatar answered Oct 22 '22 17:10

Wiktor Stribiżew