Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match emoji with an R regex?

I want to determine which elements of my vector contain emoji:

x = c('😂', 'no', '🍹', '😀', 'no', '😛', '䨺', '감사')
x
# [1] "\U0001f602" "no"         "\U0001f379" "\U0001f600" "no"         "\U0001f61b" "䨺"         "감사"

Related posts only cover other languages, and because mostly they refer to specialized libraries, I couldn't figure out a way to translate to R:

  • What is the regex to extract all the emojis from a string?
  • How do I remove emoji from string
  • replace emoji unicode symbol using regexp in javascript
  • Regular expression matching emoji in Mac OS X / iOS
  • remove unicode emoji using re in python

The second looked very promising, but alas (not fixed by supplying perl = TRUE):

x[grepl('[\u{1F600}-\u{1F6FF}]', x)]

Error: invalid \u{xxxx} sequence (line 1)

Similar issues come about from other questions. How can we match emoji in R?

like image 846
MichaelChirico Avatar asked Apr 12 '17 02:04

MichaelChirico


People also ask

Does regex work with emojis?

emoji-regex offers a regular expression to match all emoji symbols and sequences (including textual representations of emoji) as per the Unicode Standard. It's based on emoji-test-regex-pattern, which generates (at build time) the regular expression pattern based on the Unicode Standard.

Are emojis part of utf8?

Emojis look like images, or icons, but they are not. They are letters (characters) from the UTF-8 (Unicode) character set. UTF-8 covers almost all of the characters and symbols in the world.

What data type are emojis?

For beginners, emoji is a glyph, think of it as a font. Behind each laughing face, the emoji is a hexadecimal code point.

Are smileys Unicode?

Emoticons is a Unicode block containing emoticons or emoji. Most of them are intended as representations of faces, although some of them include hand gestures or non-human characters (a horned "imp", monkeys, cartoon cats). Note: This article contains Unicode emoticons or emojis.


1 Answers

I am converting the encoding to UTF-8 to compare the UTF-8 value of emoji's value with all the emoji's value in remoji library which is in UTF-8. I am using the stringr library to find the position of emoji's in the vector. One is free to use grep or any other function.

1st Method:

library(stringr)
xvect = c('😂', 'no', '🍹', '😀', 'no', '😛')

Encoding(xvect) <- "UTF-8"

which(str_detect(xvect,"[^[:ascii:]]")==T)
# [1] 1 3 4 6

Here 1,3,4 and 6 are emoji's character in this case.

Edited :

2nd Method: Install a package called remoji using devtools using below command, Since we have already converted the emoji items into UTF-8. we can now compare the UTF-8 values of all the emoji's present in the emoji library. Use trimws to remove the whitespaces

install.packages("devtools")

devtools::install_github("richfitz/remoji")
library(remoji)
emj <- emoji(list_emoji(), TRUE)
xvect %in% trimws(emj)

Output:

which(xvect %in% trimws(emo))
# [1] 1 3 4 6

Both of the above methods are not full proof and first method assumes that there are no any ascii characters other than emojis in the vector and second method relies on the library information of remoji. In case where the a certain emoji information is not present in the library, the last command may yield a FALSE instead of TRUE.

Final Edit:

As per the discussion amongst OP(@MichaelChirico) and @SymbolixAU. Thanks to both of them it seems the problem with small typo of capital U. The new regex is xvect[grepl('[\U{1F300}-\U{1F6FF}]', xvect)] . The range in the character class is taken from F300 to F6FF. One can off course change this range to a new range in cases where an emoji lies outside this range. This may not be the complete list and over the period of time these ranges may keep increasing/changing.

like image 103
PKumar Avatar answered Sep 22 '22 19:09

PKumar