Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Length of a unicode string

In my Rails (2.3, Ruby 1.8.7) application, I need to truncate a string to a certain length. the string is unicode, and when running tests in console, such as 'א'.length, I realized that a double length is returned. I would like an encoding-agnostic length, so that the same truncation would be done for a unicode string or a latin1 encoded string.

I've gone over most of the unicode material for Ruby, but am still a little in the dark. How should this problem be tackled?

like image 441
shmichael Avatar asked Aug 30 '10 23:08

shmichael


People also ask

What is the size of a Unicode character?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide.

How many characters are in a Unicode string?

The Unicode standard now encompasses 144,076 characters as of version 13.1. It includes all of your favorite emoji, as well as characters used in almost every language on the planet.

What is the Unicode value of a string?

Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.

How many bytes is a UTF-16 character?

Likewise, UTF-16 is based on 16-bit code units. Therefore, each character can be 16 bits (2 bytes) or 32 bits (4 bytes). All UTFs include the full Unicode character repertoire , or set of characters.


2 Answers

Rails has an mb_chars method which returns multibyte characters. Try unicode_string.mb_chars.slice(0,50)

like image 162
Teoulas Avatar answered Oct 13 '22 23:10

Teoulas


"ア".size # 3 in 1.8, 1 in 1.9
puts "ア".scan(/./mu).size # 1 in both 1.8 and 1.9
like image 23
Lri Avatar answered Oct 13 '22 23:10

Lri