How can I detect certain Unicode characters in a string in Ruby?

Tags:

Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.

class String
  def contains_cjk?
    ...
  end
end

>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false

I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.

920

asked Jan 13 '11 14:01

Josh Glover

3 Answers

(ruby 1.9.2)

#encoding: UTF-8
class String
  def contains_cjk?
    !!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
  end
end

strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}

#true
#true
#true
#false

\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.

Wow. Ruby Regexp source .

197

answered Nov 05 '22 07:11

steenslag

Given my Ruby 1.8.7 constraint, this is the best I could do:

class String
  CJKV_RANGES = [
      (0xe2ba80..0xe2bbbf),
      (0xe2bfb0..0xe2bfbf),
      (0xe38080..0xe380bf),
      (0xe38180..0xe383bf),
      (0xe38480..0xe386bf),
      (0xe38780..0xe387bf),
      (0xe38880..0xe38bbf),
      (0xe38c80..0xe38fbf),
      (0xe39080..0xe4b6bf),
      (0xe4b780..0xe4b7bf),
      (0xe4b880..0xe9bfbf),
      (0xea8080..0xea98bf),
      (0xeaa080..0xeaaebf),
      (0xeaaf80..0xefbfbf),
  ]

  def contains_cjkv?
    each_char do |ch|
      return true if CJKV_RANGES.any? {|range| range.member? ch.unpack('H*').first.hex }
    end
    false
  end
end


strings = ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each {|s| puts s.contains_cjkv? }

#true
#true
#true
#false

Pretty hacktacular, but it works. It actually detects a variety of Indic scripts as well, so it should probably really be called contains_asian?

Maybe I should gem this up for other poor I18N hackers stuck with Ruby 1.8.

answered Nov 05 '22 08:11

Josh Glover

I've written a little gem that packages up the approach in steenslag's answer above:

https://github.com/jpatokal/script_detector

It can also take a stab at differentiating between Japanese, Korean, simplified Chinese and traditional Chinese, although due to the complexities of Han unification it only works reliably with large slabs of text.

answered Nov 05 '22 09:11

lambshaanxy

Related questions
                            
                                Why does after_save not trigger when using touch?
                            
                                bundle exec jekyll serve: cannot load such file
                            
                                Calculate differences between array elements
                            
                                Evaluate many boolean expressions like Array#join in Ruby
                            
                                ActiveRecord - Get the last n records and delete them in one command?
                            
                                rails update_attributes returns false when trying to update db values
                            
                                Getting Ruby 1.8.7 installed on Mountain Lion (10.8)
                            
                                Rails 4.2 - Sidekiq not sending emails in development
                            
                                Convert a hex string to a hex int
                            
                                Mock filesystem in integration testing
                            
                                Is it possible to recursively require all files in a directory in Ruby?
                            
                                Ruby: Is a string in a list of values
                            
                                How can I iterate through a MySQL result set?
                            
                                Ruby gsub with index/offset?
                            
                                How do I test if a submit button exists in capybara?
                            
                                Ruby on Rails, including a module with arguments
                            
                                Rendering file with MIME Type in rails
                            
                                Ruby forgets local variables during a while loop?
                            
                                Where are the GEMs when Ruby compiled manually in Mac OS X 10.6.8?
                            
                                Rails 3: Generate unique codes (coupons)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I detect certain Unicode characters in a string in Ruby?

Tags:

character-encoding

ruby

encoding

unicode

cjk

Josh Glover

People also ask

3 Answers

steenslag

Josh Glover

lambshaanxy

Recent Activity

Donate For Us