Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does sorting mean in non-alphabetic (i.e, Asian) languages?

I have some code that sorts table columns by object properties. It occurred to me that in Japanese or Chinese (non-alphabetical languages), the strings that are sent to the sort function would be compared the way an alphabetical language would.

Take for example a list of Japanese surnames:

寿拘 (Suzuki) 松坂 (Matsuzaka) 松井 (Matsui) 山田 (Yamada) 藤本 (Fujimoto) 

When I sort the above list via Javascript, the result is:

寿拘 (Suzuki) 山田 (Yamada) 松井 (Matsui) 松坂 (Matsuzaka) 藤本 (Fujimoto) 

This is different from the ordering of the Japanese syllabary, which would arrange the list phonetically (the way a Japanese dictionary would):

寿拘 (Suzuki) 藤本 (Fujimoto) 松井 (Matsui) 松坂 (Matsuzaka) 山田 (Yamada) 

What I want to know is:

  1. Does one double-byte character really get compared against the other in a sort function?
  2. What really goes on in such a sort?
  3. (Extra credit) Does the result of such a sort mean anything at all? Does the concept of sorting really work in Asian (and other) languages? If so, what does it mean and what should one strive for in creating a compare function for those languages?

ADDENDUM TO SUMMARIZE ANSWERS AND DRAW CONCLUSIONS:

First, thanks to all who contributed to the discussion. This has been very informative and helpful. Special shout-outs to bobince, Lie Ryan, Gumbo, Jeffrey Zheng, and Larry K, for their in-depth and thoughtful analyses. I awarded the check mark to Larry K for pointing me toward a solution my question failed to foresee, but I up-ticked every answer I found useful.

The consensus appears to be that:

  1. Chinese and Japanese character strings are sorted by Unicode code points, and their ordering may be predicated on a rationale that may be in some way intelligible to knowledgeable readers but is not likely to be of much practical value in helping users to find the information they're seeking.

  2. The kind of compare function that would be required to make a sort semantically or phonetically useful is far too cumbersome to consider pursuing, especially since the results would probably be less than satisfactory, and in any case the comparison algorithms would have to be changed for each language. Best just to allow the sort to proceed without even attempting a compare function.

  3. I was probably asking the wrong question here. That is, I was thinking too much "inside the box" without considering that the real question is not how do I make sorting useful in these languages, but how do I provide the user with a useful way of finding items in a list. Westerners automatically think of sorting for this purpose, and I was guilty of that. Larry K pointed me to a Wikipedia article that suggests a filtering function might be more useful for Asian readers. This is what I plan to pursue, as it's at least as fast as sorting, client-side. I will keep the column sorting because it's well understood in Western languages, and because speakers of any language would find the sorting of dates and other numerical-based data types useful. But I will also add that filtering mechanism, which would be useful in long lists for any language.

like image 398
Robusto Avatar asked Sep 21 '10 20:09

Robusto


People also ask

Do Asian languages have alphabetical order?

However, unlike what we are used to when it comes to learning a new language, these characters are not organized into an alphabet because there is no Chinese alphabet per se. Although most languages use alphabets, Chinese doesn't. Chinese is all about Chinese characters – thousands of them.

How do Chinese sort names?

Because there is no standard way of ordering Chinese characters, Chinese speakers can choose from several methods to organize lists. Even dictionaries don't agree. Some dictionaries sort by the radical or root character. Others use the number of brushstrokes in the first character, then the second, and so on.

Do Chinese characters have an alphabetical order?

Written Chinese is not based on an alphabet or syllabary, so Chinese dictionaries, as well as dictionaries that define Chinese characters in other languages, cannot easily be alphabetized or otherwise lexically ordered, as English dictionaries are.

How are Chinese names ordered alphabetically?

2.1 Chinese Names: Chinese names are written with the family name first, followed by the prename, often hyphenated.


2 Answers

Does one double-byte character really get compared against the other in a sort function?

The native String type in JavaScript is based on UTF-16 code units, and that's what gets compared. For characters in the Basic Multilingual Plane (which all these are), this is the same as Unicode code points.

The term ‘double-byte’ as in encodings like Shift-JIS has no meaning in a web context: DOM and JavaScript strings are natively Unicode, the original bytes in the encoded page received by the browser are long gone.

Does the result of such a sort mean anything at all?

Little. Unicode code points do not claim to offer any particular ordering... for one, because there is no globally-accepted ordering. Even for the most basic case of ASCII Latin characters, languages disagree (eg. on whether v and w are the same letter, or whether the uppercase of i is I or İ). And CJK gets much gnarlier than that.

The main Unicode CJK Unified Ideographs block happens to be ordered by radical and number of strokes (Kangxi dictionary order), which may be vaguely useful. But use characters from any of the other CJK extension blocks, or mix in some kana, or romaji, and there will be no meaningful ordering between them.

The Unicode Consortium do attempt to define some general ordering rules, but it's complex and not generally attempted at a language level. Systems that really need language-sensitive sorting abilities (eg. OSes, databases) tend to have their own collation schemes.

This is different from the ordering of the Japanese syllabary

Yes. Above and beyond collation issues in general, it's a massively difficult task to handle kanji accurately by syllable, because you have to guess at the pronunciation. JavaScript can't realistically know that by ‘藤本’ you mean ‘Fujimoto’ and not ‘touhon’; this sort of thing requires in-depth built-in dictionaries and still-unreliable heuristics... not the sort of thing you want to build in to a programming language.

like image 79
bobince Avatar answered Sep 22 '22 20:09

bobince


You could implement the Unicode Collation Algorithm in Javascript if you want something better than the default JS sort for strings. Might improve some things. Though as the Unicode doc states:

Collation is not uniform; it varies according to language and culture: Germans, French and Swedes sort the same characters differently. It may also vary by specific application: even within the same language, dictionaries may sort differently than phonebooks or book indices. For non-alphabetic scripts such as East Asian ideographs, collation can be either phonetic or based on the appearance of the character.

The Wikipedia article points out that since collation is so tough in non-alphabetic scripts, now a days the answer is to make it very easy to look up information by entering characters, rather than by looking through a list.

I suggest that you talk to truly knowledgeable end users of your application to see how they would best like it to behave. The problem of ordering Chinese characters is not unique to your application.

Also, if you don't want to implement the collation in your system, another solution would for you to create a Ajax service that stores the names in a MySql or other database, then looks up the data with an order statement.

like image 39
Larry K Avatar answered Sep 19 '22 20:09

Larry K