My problem is to remove emoji from a string, but not CJK (Chinese, Japanese, Korean) characters from a string using regex. I tried to use this regex:
REGEX = /[^\u1F600-\u1F6FF\s]/i
This regex works fine except it also detects the Chinese, Japanese and Korean character where I need those characters. Any idea how to solve this issue?
replace() , string. trim() methods and RegExp works best in the majority of the cases. First of all, we use replace() and RegExp to remove any emojis from the string.
Snapseed. Snapseed is a mobile app that can remove emoji from picture on both Android and iOS devices. It has a healing feature that can help you to remove unwanted content from image quickly and easily.
Karol S already provided a solution, but the reason might not be clear:
"\u1F600"
is actually "\u1F60"
followed by "0"
:
"\u1F60" # => "α½ "
"\u1F600" # => "α½ 0"
You have to use curly braces for code points above FFFF:
"\u{1F600}" #=> "π"
Therefore the character class [\u1F600-\u1F6FF]
is interpreted as [\u1F60 0-\u1F6F F]
, i.e. it
matches "\u1F60"
, the range "0"
.."\u1F6F"
and "F"
.
Using curly braces solves the issue:
/[\u{1F600}-\u{1F6FF}]/
This matches (emoji) characters in these unicode blocks:
You can also use unpack
, pack
, and between?
to achieve a similar result. This also works for Ruby 1.8.7 which doesn't support Unicode in regular expressions.
s = 'Hi!π'
#=> "Hi!\360\237\230\200"
s.unpack('U*').reject{ |e| e.between?(0x1F600, 0x1F6FF) }.pack('U*')
#=> "Hi!"
Regarding your Rubular example β Emoji are single characters:
"π".length #=> 1
"π".chars #=> ["π"]
Whereas kaomoji are a combination of multiple characters:
"^_^".length #=> 3
"^_^".chars #=> ["^", "_", "^"]
Matching these is a very different task (and you should ask that in a separate question).
I am using one based on this script.
def strip_emoji(text)
text = text.force_encoding('utf-8').encode
clean = ""
# symbols & pics
regex = /[\u{1f300}-\u{1f5ff}]/
clean = text.gsub regex, ""
# enclosed chars
regex = /[\u{2500}-\u{2BEF}]/ # I changed this to exclude chinese char
clean = clean.gsub regex, ""
# emoticons
regex = /[\u{1f600}-\u{1f64f}]/
clean = clean.gsub regex, ""
#dingbats
regex = /[\u{2702}-\u{27b0}]/
clean = clean.gsub regex, ""
end
Results:
irb> strip_emoji("π½πββ€εγΏμβ€")
=> "εγΏμ"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With