Bytes vs codepoints in ruby

Question

What is the difference between ruby string functions:- codepoints and bytes

'abcd'.bytes
=> [97, 98, 99, 100]

'abcd'.codepoints
=> [97, 98, 99, 100]

Sergio Tulentsev · Accepted Answer

bytes returns individual bytes, regardless of char size, whereas codepoints returns unicode codepoints.

s = '日本語'
s.bytes # => [230, 151, 165, 230, 156, 172, 232, 170, 158]
s.codepoints # => [26085, 26412, 35486]
s.chars # => ["日", "本", "語"]

I see where your confusion arises from. Ruby uses utf-8 encoding by default now and utf-8 was specifically designed so that its first codepoints (0-127) are exactly the same as in ASCII encoding. ASCII is an encoding with one-byte chars, so in examples in your question methods bytes and codepoints return the same values, coincindentally.

So, if you need to break string into characters, use either chars or codepoints (whichever is appropriate for your use case). Use bytes only when you treat string as an opaque binary blob, not text.

Actually, chars (suggested above) might not be accurate enough, since unicode has notion of combining characters and modifier letters. If you care about this, you need to use so-called "grapheme clusters". Here's an example (taken from this answer:

glyph

s = "a\u0308\u0303\u0323\u032d"
s.bytes # => [97, 204, 136, 204, 131, 204, 163, 204, 173]
s.codepoints # => [97, 776, 771, 803, 813]
s.chars # => ["a", "̈", "̃", "̣", "̭"]
s.grapheme_clusters # => ["ạ̭̈̃"] # rendering of this glyph is kinda broken, which illustrates the point that unicode is hard

Bytes vs codepoints in ruby

Tags:

ruby

Vivak kumar

1 Answers

Sergio Tulentsev

Recent Activity

Donate For Us

Bytes vs codepoints in ruby

Tags:

ruby

Vivak kumar

1 Answers

Sergio Tulentsev

Related questions

Recent Activity

Donate For Us