Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode Characters that can be used to trick a string sorter?

Since Unicode lacks a series of zero width sorting characters, I need to determine equivalent characters that will allow me to force a certain order on a list that is automatically sorted by character values. Unfortunately the list items are not in an alphabetical order, nor is it acceptable to prefix them with visible characters to ensure the result of the sort matches the wanted outcome.

What Unicode characters can be thrown in front of regular Latin alphabet text, and will not appear, but still allow me to "spike" the sort in the way I require?

(BTW this is being done with Drupal 5 with a user profile list field. Don't bother suggesting changing that to a vocabulary/category.)

like image 725
Chris Charabaruk Avatar asked Sep 30 '08 05:09

Chris Charabaruk


2 Answers

Zero-width space (U+200B) should probably do what you want. From the Unicode spec:

Zero Width Space. The U+200B ZERO WIDTH SPACE indicates a line break opportunity, except that it has no width. Zero-width space characters are intended to be used in languages that have no visible word spacing to represent line break opportunities, such as Thai, Khmer, and Japanese.

Should be in most fonts you run into, but YMMV.

like image 64
Joe Hildebrand Avatar answered Oct 04 '22 01:10

Joe Hildebrand


Personally, I just prefer to use a primary/secondary sort key. It's less kludgy, and easy to implement in a typical sql query (ORDER BY column_a,column_b). Edited to add: In Php, you could use usort(array, comparisonFunction) with a custom comparison function to add additional logic for sorting, if you can't use SQL to do the trick.

However, if you only have one column to work with and that's unfixable, just prefix with a certain number of unlikely characters like underscores for sorting, then strip them just before you display them. (using regexp substitution or similar).

Unicode-based hacks will depend heavily on what fonts are used, what locale's collation/sorting order you're using, and may produce undesirable side effects on clients you don't have control over (different browsers, different oses, different client locales). Most "unprintable" characters yield the "unknown character" when displayed on systems without support for them, which usually looks like an empty square. There are some zero-width characters used for languages like Arabic, but they shouldn't affect sorting except in applications with very perverse Unicode support.

like image 45
JasonTrue Avatar answered Oct 04 '22 00:10

JasonTrue