Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I check equality of Unicode strings in Javascript?

I have two strings in Javascript: "_strange_chars_µö¬é@zendesk.com.eml" (f1) and "_strange_chars_µö¬é@zendesk.com.eml" (f2). At first glance, they look identical (and, indeed, on StackOverflow, they may be; I'm not sure what happens when they are pasted into a form like this.) In my application, however,

f1[16] // ö
f2[16] // o
f1[17] // ¬
f2[17] // ̈

That is, where f1 uses the ö character, f2 uses an o and a diacritic ¨ as a separate character. What comparison can I do that will show these two strings to be "equal"?

like image 557
James A. Rosen Avatar asked Aug 17 '11 18:08

James A. Rosen


People also ask

How do you check if a string is equal to another string in JavaScript?

Explanation of the example: Here, the equality operator (==) is used to check if both the strings are the same. In the second case, when str1 and str3 are compared after using the toUpperCase method, the expression javascript == javascript would return true as they are the same strings after both being in lower case.

Can you compare characters in JavaScript?

In JavaScript, strings can be compared based on their “value”, “characters case”, “length”, or “alphabetically” order: To compare strings based on their values and characters case, use the “Strict Equality Operator (===)”.

Can you compare numbers as strings JavaScript?

Comparing data of different types may give unexpected results. When comparing a string with a number, JavaScript will convert the string to a number when doing the comparison. An empty string converts to 0. A non-numeric string converts to NaN which is always false .

Which method is used to match two strings JavaScript?

The localeCompare() method compares two strings in the current locale. The localeCompare() method returns sort order -1, 1, or 0 (for before, after, or equal).


1 Answers

f1 uses the ö character, f2 uses an o and a diacritic ¨ as a separate character.

f1 is in Normal Form C (composed) and f2 in Normal Form D (decomposed). In general Normal Form C is the most common on Windows and the web, with the Unicode FAQ describing it as “the best form for general text”. Unfortunately the Apple world plumped for Normal Form D in order to be gratuitously different.

The strings are canonically equivalent by the rules of Unicode equivalence.

What comparison can I do that will show these two strings to be "equal"?

In general, you convert both strings to one Normal Form of your choosing and then compare them. For example in Python:

>>> import unicodedata
>>> a= u'\u00F6'  # ö composed
>>> b= u'o\u0308' # o then combining umlaut
>>> unicodedata.normalize('NFC', a)==unicodedata.normalize('NFC', b)
True

Similarly Java has the Normalizer class, .NET has String.Normalize, and may languages have bindings available to the ICU library which also offers this feature.

Unfortunately, JavaScript has no native Unicode normalisation ability. This means either:

  • doing it yourself, carting around large Unicode data tables to cover it all in JavaScript (see eg here for an example implementation); or

  • sending it back to the server-side (eg via XMLHttpRequest), where you've got a better-equipped language to do it.

like image 73
bobince Avatar answered Sep 23 '22 00:09

bobince