Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby, problems comparing strings with UTF-8 characters

I have these 2 UTF-8 strings:

a = "N\u01b0\u0303"
b = "N\u1eef"

They look pretty different but the are the same once they are rendered:

irb(main):039:0> puts "#{a} - #{b}"
Nữ - Nữ

The a version is the one I have stored in the DB. The b version is the one is coming from the browser in a POST request, I don't know why the browser is sending a different combination of UTF8 characters, and it is not happening always, I can't reproduce the issue in my dev environment, it happens in production and in a percentage of the total requests.

The case is that I try to compare both of them but they return false:

irb(main):035:0> a == b
=> false

I've tried different things like forcing encoding:

irb(main):022:0> c.force_encoding("UTF-8") == a.force_encoding("UTF-8")
=> false

Another interesting fact is:

irb(main):005:0> a.chars
=> ["N", "ư", "̃"]
irb(main):006:0> b.chars
=> ["N", "ữ"]

How can I compare these kind of strings?

like image 870
fguillen Avatar asked Nov 24 '15 14:11

fguillen


2 Answers

This is an issue with Unicode equivalence.

The a version of your string consists of the character ư (U+01B0: LATIN SMALL LETTER U WITH HORN), followed by U+0303 COMBINING TILDE. This second character, as the name suggests is a combining character, which when rendered is combined with the previous character to produce the final glyph.

The b version of the string uses the character (U+1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE) which is a single character, and is equivalent to the previous combination, but uses a different byte sequence to represent it.

In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters. Current versions of Ruby have this built in (in earlier versions you needed to use a third party library).

So currently you have

a == b

which is false, but if you do

a.unicode_normalize == b.unicode_normalize

you should get true.

If you are on an older version of Ruby, there are a couple of options. Rails has a normalize method as part of its multibyte support, so if you are using Rails you can do:

a.mb_chars.normalize == b.mb_chars.normalize

or perhaps something like:

ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)

If you’re not using Rails, then you could look at the unicode_utils gem, and do something like this:

UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)

(nfkc refers to the normalisation form, it is the same as the default in the other techniques.)

There are various different ways to normalise unicode strings (i.e. whether you use the decomposed or combined versions), and this example just uses the default. I’ll leave researching the differences to you.

like image 153
matt Avatar answered Oct 16 '22 10:10

matt


You can see these are distinct characters. First and second. In the first case, it is using a modifier "combining tilde".

Wikipedia has a section on this:

Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other.

and

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.

It seems that Ruby supports this normalization, but only as of Ruby 2.2:

http://ruby-doc.org/stdlib-2.2.0/libdoc/unicode_normalize/rdoc/String.html

a = "N\u01b0\u0303".unicode_normalize
b = "N\u1eef".unicode_normalize

a == b  # true

Alternatively, if you are using Ruby on Rails, there appears to be a built-in method for normalization.

like image 3
14 revs, 12 users 16% Avatar answered Oct 16 '22 08:10

14 revs, 12 users 16%