I have two strings:
a = 'hà nội'
b = 'hà nội'
When I compare them with a == b
, it returns false
.
I checked the byte codes:
a.bytes = [104, 97, 204, 128, 32, 110, 195, 180, 204, 163, 105]
b.bytes = [104, 195, 160, 32, 110, 225, 187, 153, 105]
What is the cause? How can I fix it so that a == b
returns true
?
A byte string is a fixed-length array of bytes. A byte is an exact integer between 0 and 255 inclusive. A byte string can be mutable or immutable. When an immutable byte string is provided to a procedure like bytes-set!, the exn:fail:contract exception is raised.
Since bytes is the binary data while String is character data. It is important to know the original encoding of the text from which the byte array has created. When we use a different character encoding, we do not get the original string back.
This is an issue with Unicode equivalence.
In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters.
a.unicode_normalize == b.unicode_normalize
unicode_normalize(form=:nfc)
[link]
Returns a normalized form of str, using Unicode normalizations NFC, NFD, NFKC, or NFKD. The normalization form used is determined by form, which is any of the four values :nfc, :nfd, :nfkc, or :nfkd. The default is :nfc.
If the string is not in a Unicode Encoding, then an Exception is raised. In this context, 'Unicode Encoding' means any of UTF-8, UTF-16BE/LE, and UTF-32BE/LE, as well as GB18030, UCS_2BE, and UCS_4BE. Anything else than UTF-8 is implemented by converting to UTF-8, which makes it slower than UTF-8.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With