Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove emoji from string

Tags:

My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:

REGEX = /[^\u1F600-\u1F6FF\s]/i

This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?

like image 570
kilua Avatar asked Jul 10 '14 09:07

kilua


People also ask

How do I get rid of emoji strings?

replace() , string. trim() methods and RegExp works best in the majority of the cases. First of all, we use replace() and RegExp to remove any emojis from the string.

How do you filter emojis?

Snapseed. Snapseed is a mobile app that can remove emoji from picture on both Android and iOS devices. It has a healing feature that can help you to remove unwanted content from image quickly and easily.


2 Answers

Karol S already provided a solution, but the reason might not be clear:

"\u1F600" is actually "\u1F60" followed by "0":

"\u1F60"    # => "α½ "
"\u1F600"   # => "α½ 0"

You have to use curly braces for code points above FFFF:

"\u{1F600}" #=> "πŸ˜€"

Therefore the character class [\u1F600-\u1F6FF] is interpreted as [\u1F60 0-\u1F6F F], i.e. it matches "\u1F60", the range "0".."\u1F6F" and "F".

Using curly braces solves the issue:

/[\u{1F600}-\u{1F6FF}]/

This matches (emoji) characters in these unicode blocks:

  • U+1F600..U+1F64F Emoticons
  • U+1F650..U+1F67F Ornamental Dingbats
  • U+1F680..U+1F6FF Transport and Map Symbols

You can also use unpack, pack, and between? to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.

s = 'Hi!πŸ˜€'
#=> "Hi!\360\237\230\200"

s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!" 

Regarding your Rubular example – Emoji are single characters:

"πŸ˜€".length  #=> 1
"πŸ˜€".chars   #=> ["πŸ˜€"]

Whereas kaomoji are a combination of multiple characters:

"^_^".length #=> 3
"^_^".chars  #=> ["^", "_", "^"]

Matching these is a very different task (and you should ask that in a separate question).

like image 193
Stefan Avatar answered Nov 30 '22 23:11

Stefan


I am using one based on this script.

 def strip_emoji(text)
    text = text.force_encoding('utf-8').encode
    clean = ""

    # symbols & pics
    regex = /[\u{1f300}-\u{1f5ff}]/
    clean = text.gsub regex, ""

    # enclosed chars 
    regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
    clean = clean.gsub regex, ""

    # emoticons
    regex = /[\u{1f600}-\u{1f64f}]/
    clean = clean.gsub regex, ""

    #dingbats
    regex = /[\u{2702}-\u{27b0}]/
    clean = clean.gsub regex, ""
  end

Results:

irb> strip_emoji("πŸ‘½πŸ˜€β˜‚β€εŽγΏμ›β€")
=> "εŽγΏμ›"
like image 40
jellene Avatar answered Dec 01 '22 00:12

jellene