Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bytes vs codepoints in ruby

Tags:

ruby

What is the difference between ruby string functions:- codepoints and bytes

'abcd'.bytes
=> [97, 98, 99, 100]

'abcd'.codepoints
=> [97, 98, 99, 100]
like image 282
Vivak kumar Avatar asked Nov 28 '16 16:11

Vivak kumar


1 Answers

bytes returns individual bytes, regardless of char size, whereas codepoints returns unicode codepoints.

s = '日本語'
s.bytes # => [230, 151, 165, 230, 156, 172, 232, 170, 158]
s.codepoints # => [26085, 26412, 35486]
s.chars # => ["日", "本", "語"]

I see where your confusion arises from. Ruby uses utf-8 encoding by default now and utf-8 was specifically designed so that its first codepoints (0-127) are exactly the same as in ASCII encoding. ASCII is an encoding with one-byte chars, so in examples in your question methods bytes and codepoints return the same values, coincindentally.

So, if you need to break string into characters, use either chars or codepoints (whichever is appropriate for your use case). Use bytes only when you treat string as an opaque binary blob, not text.


Actually, chars (suggested above) might not be accurate enough, since unicode has notion of combining characters and modifier letters. If you care about this, you need to use so-called "grapheme clusters". Here's an example (taken from this answer:

glyph

s = "a\u0308\u0303\u0323\u032d"
s.bytes # => [97, 204, 136, 204, 131, 204, 163, 204, 173]
s.codepoints # => [97, 776, 771, 803, 813]
s.chars # => ["a", "̈", "̃", "̣", "̭"]
s.grapheme_clusters # => ["ạ̭̈̃"] # rendering of this glyph is kinda broken, which illustrates the point that unicode is hard
like image 104
Sergio Tulentsev Avatar answered Oct 08 '22 19:10

Sergio Tulentsev