cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

Question

In Ruby, Javascript and Java (others I didn't try), have cyrillic chars Я̆ Я̄ Я̈ length 2. When I try to check length of string with these chars indside, I get bad output value.

"Я̈".mb_chars.length
#=> 2  #should be 1 (ruby on rails)

"Я̆".length
#=> 2  #should be 1 (ruby, javascript)

"Ӭ".length
#=> 1  #correct (ruby, javascript)

Please note, that strings are encoded in UTF-8 and each char behave as single character.

My question is why is there such behaviour and how can I get length of string correctly with these chars inside?

mu is too short · Accepted Answer

The underlying problem is that Я̈ is actually two code points: the Я and the umlaut are separate:

'Я̈'.chars
#=> ["Я", "̈"]

Normally you'd solve this sort of problem through unicode normalization but that alone won't help you here as there is no single code point for Я̈ or Я̆ (but there is for Ӭ).

You could strip off the diacritics before checking the length:

'Я̆'.gsub(/\p{Diacritic}/, '')
#=> "Я" 
'Я̆'.gsub(/\p{Diacritic}/, '').length
#=> 1

You'll get the desired length but the strings won't be quite the same. This also works on things like Ӭ which can be represented by a single code point:

'Ӭ'.length
#=> 1
'Ӭ'.gsub(/\p{Diacritic}/, '')
#=> "Ӭ" 
'Ӭ'.gsub(/\p{Diacritic}/, '').length
#=> 1

Unicode is wonderful and awesome and solves many problems that used to plague us. Unfortunately, Unicode is also horrible and complicated because human languages and glyphs weren't exactly designed.

Stefan · Answer

Ruby 2.5 adds String#each_grapheme_cluster:

'Я̆Я̄Я̈'.each_grapheme_cluster.to_a   #=> ["Я̆", "Я̄", "Я̈"]
'Я̆Я̄Я̈'.each_grapheme_cluster.count  #=> 3

Note that you can't use each_grapheme_cluster.size which is equivalent to each_char.size, so both would return 6 in the above example. (That looks like a bug, I've just filed a bug report)

cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

Tags:

string

ruby

ruby-on-rails

utf-8

unicode-normalization

tomkra

2 Answers

mu is too short

Stefan

Recent Activity

Donate For Us

cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

Tags:

string

ruby

ruby-on-rails

utf-8

unicode-normalization

tomkra

2 Answers

mu is too short

Stefan

Related questions

Recent Activity

Donate For Us