Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Same string but different bytes codes

Tags:

ruby

I have two strings:

a = 'hà nội'
b = 'hà nội'

When I compare them with a == b, it returns false.

I checked the byte codes:

a.bytes = [104, 97, 204, 128, 32, 110, 195, 180, 204, 163, 105]
b.bytes = [104, 195, 160, 32, 110, 225, 187, 153, 105]

What is the cause? How can I fix it so that a == b returns true?

like image 427
Toàn Avatar asked Jan 27 '18 03:01

Toàn


People also ask

What is a bytes string?

A byte string is a fixed-length array of bytes. A byte is an exact integer between 0 and 255 inclusive. A byte string can be mutable or immutable. When an immutable byte string is provided to a procedure like bytes-set!, the exn:fail:contract exception is raised.

Is byte array same as string?

Since bytes is the binary data while String is character data. It is important to know the original encoding of the text from which the byte array has created. When we use a different character encoding, we do not get the original string back.


1 Answers

This is an issue with Unicode equivalence.

In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters.

a.unicode_normalize == b.unicode_normalize

unicode_normalize(form=:nfc) [link]

Returns a normalized form of str, using Unicode normalizations NFC, NFD, NFKC, or NFKD. The normalization form used is determined by form, which is any of the four values :nfc, :nfd, :nfkc, or :nfkd. The default is :nfc.

If the string is not in a Unicode Encoding, then an Exception is raised. In this context, 'Unicode Encoding' means any of UTF-8, UTF-16BE/LE, and UTF-32BE/LE, as well as GB18030, UCS_2BE, and UCS_4BE. Anything else than UTF-8 is implemented by converting to UTF-8, which makes it slower than UTF-8.

like image 196
fongfan999 Avatar answered Oct 19 '22 11:10

fongfan999