Given a string in Ruby 1.8.7 (without the awesome Oniguruma regular expression engine that supports Unicode properties with \p{}), I would like to be able to determine if the string contains one or more Chinese, Japanese, or Korean characters; i.e.
class String
def contains_cjk?
...
end
end
>> '日本語'.contains_cjk?
=> true
>> '광고 프로그램'.contains_cjk?
=> true
>> '艾弗森将退出篮坛'.contains_cjk?
=> true
>> 'Watashi ha bakana gaijin desu.'.contains_cjk?
=> false
I suspect that this will boil down to seeing if any of the characters in the string are in the Unihan CJKV Unicode blocks, but I figured it was worth asking if anyone knows of an existing solution in Ruby.
To check if a given String contains only unicode letters, digits or space, we use the isLetterOrDigit() and charAt() methods with decision making statements. The isLetterOrDigit(char ch) method determines whether the specific character (Unicode ch) is either a letter or a digit.
To test if a particular Regex matches (part of) a string, use the =∼ operator. This operator returns the character position in the string of the start of the match (which evaluates to true in a boolean test), or nil if no match was found (which evaluates to false).
One can access the characters in a string in Ruby simply by specifying the character number(index) within square brackets.
(ruby 1.9.2)
#encoding: UTF-8
class String
def contains_cjk?
!!(self =~ /\p{Han}|\p{Katakana}|\p{Hiragana}|\p{Hangul}/)
end
end
strings= ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each{|s| puts s.contains_cjk?}
#true
#true
#true
#false
\p{} matches a character’s Unicode script.
The following scripts are supported: Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurashtra, Shavian, Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, and Yi.
Wow. Ruby Regexp source .
Given my Ruby 1.8.7 constraint, this is the best I could do:
class String
CJKV_RANGES = [
(0xe2ba80..0xe2bbbf),
(0xe2bfb0..0xe2bfbf),
(0xe38080..0xe380bf),
(0xe38180..0xe383bf),
(0xe38480..0xe386bf),
(0xe38780..0xe387bf),
(0xe38880..0xe38bbf),
(0xe38c80..0xe38fbf),
(0xe39080..0xe4b6bf),
(0xe4b780..0xe4b7bf),
(0xe4b880..0xe9bfbf),
(0xea8080..0xea98bf),
(0xeaa080..0xeaaebf),
(0xeaaf80..0xefbfbf),
]
def contains_cjkv?
each_char do |ch|
return true if CJKV_RANGES.any? {|range| range.member? ch.unpack('H*').first.hex }
end
false
end
end
strings = ['日本', '광고 프로그램', '艾弗森将退出篮坛', 'Watashi ha bakana gaijin desu.']
strings.each {|s| puts s.contains_cjkv? }
#true
#true
#true
#false
Pretty hacktacular, but it works. It actually detects a variety of Indic scripts as well, so it should probably really be called contains_asian?
Maybe I should gem this up for other poor I18N hackers stuck with Ruby 1.8.
I've written a little gem that packages up the approach in steenslag's answer above:
https://github.com/jpatokal/script_detector
It can also take a stab at differentiating between Japanese, Korean, simplified Chinese and traditional Chinese, although due to the complexities of Han unification it only works reliably with large slabs of text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With