My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex: <pre class="prettyprint"><code>REGEX = /[^\u1F600-\u1F6FF\s]/i </code></pre> This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?

Karol S already provided a solution, but the reason might not be clear: <code>"\u1F600"</code> is actually <code>"\u1F60"</code> followed by <code>"0"</code>: <pre class="prettyprint"><code>"\u1F60" # => "ὠ" "\u1F600" # => "ὠ0" </code></pre> You have to use curly braces for code points above FFFF: <pre class="prettyprint"><code>"\u{1F600}" #=> "😀" </code></pre> Therefore the character class <code>[\u1F600-\u1F6FF]</code> is interpreted as <code>[\u1F60 0-\u1F6F F]</code>, i.e. it matches <code>"\u1F60"</code>, the range <code>"0"</code>..<code>"\u1F6F"</code> and <code>"F"</code>. Using curly braces solves the issue: <pre class="prettyprint"><code>/[\u{1F600}-\u{1F6FF}]/ </code></pre> This matches (emoji) characters in these unicode blocks: <ul> <li>U+1F600..U+1F64F Emoticons </li> <li>U+1F650..U+1F67F Ornamental Dingbats </li> <li>U+1F680..U+1F6FF Transport and Map Symbols </li> </ul> <hr> You can also use <code>unpack</code>, <code>pack</code>, and <code>between?</code> to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions. <pre class="prettyprint"><code>s = 'Hi!😀' #=> "Hi!\360\237\230\200" s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*') #=> "Hi!" </code></pre> <hr> Regarding your Rubular example – Emoji are single characters: <pre class="prettyprint"><code>"😀".length #=> 1 "😀".chars #=> ["😀"] </code></pre> Whereas kaomoji are a combination of multiple characters: <pre class="prettyprint"><code>"^_^".length #=> 3 "^_^".chars #=> ["^", "_", "^"] </code></pre> Matching these is a very different task (and you should ask that in a separate question).

How do I remove emoji from string

Tags:

My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:

REGEX = /[^\u1F600-\u1F6FF\s]/i

This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?

570

asked Jul 10 '14 09:07

kilua

2 Answers

Karol S already provided a solution, but the reason might not be clear:

"\u1F600" is actually "\u1F60" followed by "0":

"\u1F60"    # => "ὠ"
"\u1F600"   # => "ὠ0"

You have to use curly braces for code points above FFFF:

"\u{1F600}" #=> "😀"

Therefore the character class [\u1F600-\u1F6FF] is interpreted as [\u1F60 0-\u1F6F F], i.e. it matches "\u1F60", the range "0".."\u1F6F" and "F".

Using curly braces solves the issue:

/[\u{1F600}-\u{1F6FF}]/

This matches (emoji) characters in these unicode blocks:

U+1F600..U+1F64F Emoticons
U+1F650..U+1F67F Ornamental Dingbats
U+1F680..U+1F6FF Transport and Map Symbols

You can also use unpack, pack, and between? to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.

s = 'Hi!😀'
#=> "Hi!\360\237\230\200"

s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!"

Regarding your Rubular example – Emoji are single characters:

"😀".length  #=> 1
"😀".chars   #=> ["😀"]

Whereas kaomoji are a combination of multiple characters:

"^_^".length #=> 3
"^_^".chars  #=> ["^", "_", "^"]

Matching these is a very different task (and you should ask that in a separate question).

193

answered Nov 30 '22 23:11

Stefan

I am using one based on this script.

 def strip_emoji(text)
    text = text.force_encoding('utf-8').encode
    clean = ""

    # symbols & pics
    regex = /[\u{1f300}-\u{1f5ff}]/
    clean = text.gsub regex, ""

    # enclosed chars 
    regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
    clean = clean.gsub regex, ""

    # emoticons
    regex = /[\u{1f600}-\u{1f64f}]/
    clean = clean.gsub regex, ""

    #dingbats
    regex = /[\u{2702}-\u{27b0}]/
    clean = clean.gsub regex, ""
  end

Results:

irb> strip_emoji("👽😀☂❤华み원❤")
=> "华み원"

answered Dec 01 '22 00:12

jellene

Related questions
                            
                                WKWebView Estimated Progress
                            
                                Android find all hardcoded strings in code using Android Studio
                            
                                How to ask Spring Cloud Config server to checkout configuration from specific branch?
                            
                                Selenium Python - Handling No such element exception
                            
                                Mongodb getting error while creating new user
                            
                                Run as different user under FreeBSD [closed]
                            
                                Choosing the right database: MySQL vs. Everything else
                            
                                Zend Framework - Set 'selected' value in select box dropdown list
                            
                                Convert from English Digits to Arabic ones in html page
                            
                                round BigDecimal to nearest 5 cents
                            
                                UISegmentedControl - altering height in Interface Builder
                            
                                How do I get Google Maps to show a whole polygon?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With