Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if String Contains an Emoji in Ruby

In ruby, here is how you can check for a substring in a string:

str = "hello world"
str.include?("lo")
=> true

When I am attempting to save an emoji in a text column in a rails application (the text column within a mysql database is utf8), it comes back with this error:

Incorrect string value: \xF0\x9F\x99\x82

For my situation in a rails application, it suffices to see if an emoji is present in the submitted text. If an emoji is present: raise a validation error. Example:

class MyModel < ApplicationRecord
  validate :cannot_contain_emojis

  private

  def cannot_contain_emojis
    if my_column.include?("/\xF0")
      errors.add(:my_column, 'Cannot include emojis")
    end 
  end
end

Note: The reason I am checking for \xF0 is because according to this site, it appears that all, or most, emoji's begin with this signature.

This however does not work. It continues to return false even when it is true. I'm pretty sure the issue is that my include statement doesn't work because the emoji is not converted to bytes for the comparison.

Question How can I make a validation to check that an emoji is not passed in?

  • Example bytes for a smiley face in UTF8: \xF0\x9F\x99\x82
like image 250
Neil Avatar asked Dec 17 '22 13:12

Neil


1 Answers

You can use the Emoji Unicode property to test for Emoji using a Regexp, something like this:

def cannot_contain_emojis
  if /\p{Emoji}/ =~ my_column
    errors.add(:my_column, 'Cannot include emojis')
  end 
end

Unicode® Technical Standard #51 "UNICODE EMOJI" contains a more sophisticated regex:

\p{RI} \p{RI} 
| \p{Emoji} 
  ( \p{EMod} 
  | \x{FE0F} \x{20E3}? 
  | [\x{E0020}-\x{E007E}]+ \x{E007F} )?
  (\x{200D} \p{Emoji}
    ( \p{EMod} 
    | \x{FE0F} \x{20E3}? 
    | [\x{E0020}-\x{E007E}]+ \x{E007F} )?
  )*

[Note: some of those properties are not implemented in Onigmo / Ruby.]

However, checking for Emojis probably not going to be enough. It is pretty clear that your text processing is somehow broken at some point. And if it is broken by an Emoji, then there is a chance it will also be broken by my name, or the name of Ruby's creator 松本 行弘, or by the completely normal English word “naïve”.

Instead of playing a game of whack-a-mole trying to detect every Emoji, mathematical symbol, Arabic letter, typographically correct punctuation mark, etc., it would be much better simply the fix the text processing.

like image 81
Jörg W Mittag Avatar answered Jan 04 '23 09:01

Jörg W Mittag