What is the difference between ruby string functions:- codepoints and bytes
'abcd'.bytes
=> [97, 98, 99, 100]
'abcd'.codepoints
=> [97, 98, 99, 100]
bytes
returns individual bytes, regardless of char size, whereas codepoints
returns unicode codepoints.
s = '日本語'
s.bytes # => [230, 151, 165, 230, 156, 172, 232, 170, 158]
s.codepoints # => [26085, 26412, 35486]
s.chars # => ["日", "本", "語"]
I see where your confusion arises from. Ruby uses utf-8 encoding by default now and utf-8 was specifically designed so that its first codepoints (0-127) are exactly the same as in ASCII encoding. ASCII is an encoding with one-byte chars, so in examples in your question methods bytes
and codepoints
return the same values, coincindentally.
So, if you need to break string into characters, use either chars
or codepoints
(whichever is appropriate for your use case). Use bytes
only when you treat string as an opaque binary blob, not text.
Actually, chars
(suggested above) might not be accurate enough, since unicode has notion of combining characters and modifier letters. If you care about this, you need to use so-called "grapheme clusters". Here's an example (taken from this answer:
s = "a\u0308\u0303\u0323\u032d"
s.bytes # => [97, 204, 136, 204, 131, 204, 163, 204, 173]
s.codepoints # => [97, 776, 771, 803, 813]
s.chars # => ["a", "̈", "̃", "̣", "̭"]
s.grapheme_clusters # => ["ạ̭̈̃"] # rendering of this glyph is kinda broken, which illustrates the point that unicode is hard
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With