Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I compare unicode strings containing non-english characters to sort alpabetically?

I am trying to sort array/lists/whatever of data based upon the unicode string values in them which contain non-english characters, I want them sorted correctly alphabetically.

I have written a lot of code (D2010, win XP), which I thought was pretty solid for future internationalisation, but it is not. Its all using unicodestring (string) data type, which up until now I have just been putting english characters into the unicode strings.

It seems I have to own up to making a very serious unicode mistake. I talked to my German friend, and tried out some German ß's, (ß is 'ss' and should come after S and before T in alphabet) and and ö's etc (note the umlaut) and none of my sorting algorithms work anymore. Results are very mixed up. Garbage.

Since then I have been reading up extensively and learnt a lot of unpleasant things with regards to unicode collation. Things are looking grim, much grimmer than I ever expected, I have seriously messed this up. I hope I am missing something and things are not actually quite as grim as they appear at present. I have been tinkering around looking at windows api calls (RtlCompareUnicodeString) with no success (protection faults), I could not get it to work. Problem with API calls I learnt is that they change on various newer windows platforms, and also with delphi going cross plat soon, with linux later, my app is client server so I need to be concerned about this, but tbh with the situation being what is it (bad) I would be grateful for any forward progress, ie win api specific.

Is using win api function RtlCompareUnicodeString to obvious solution? If so I should really try again with that but tbh I have been taken aback by all of the issues involved with unicode collation and I not clear at all what I should be doing to compare these strings this way anyway.

I learnt of the IBM ICU c++ opensource project, there is a delphi wrapper for it albeit for an older version of ICU. It seems a very comprehensive solution which is platform independant. Surely I cannot be looking at creating a delphi wrapper for this (or updating the existing one) to get a good solution for unicode collation?

I would be extremely glad to hear advice at two levels :-

A) A windows specific non portable solution, I would be glad off that at the moment, forget the client server ramifications! B) A more portable solution which is immune from the various XP/vista/win7 variations of unicode api functions, therefore putting me in good stead for XE2 mac support and future linux support, not to mention the client server complications.

Btw I dont really want to be doing 'make-do' solutions, scanning strings prior to comparison and replacing certain tricky characters etc, which I have read about. I gave the German examplle above, thats just an example, I want to get it working for all (or at least most, far east, russian) languages, I don't want to do workarounds for a specific language or two. I also do not need any advice on the sorting algorithms, they are fine, its just the string comparison bit that's wrong.

I hope I am missing/doing something stupid, this all looks to be a headache.

Thank you.


EDIT, Rudy, here is how I was trying to call RtlCompareUnicodeString. Sorry for the delay I have been having a horrible time with this.

program Project26

{$APPTYPE CONSOLE}

uses
  SysUtils;


var
  a,b:ansistring;

  k,l:string;
  x,y:widestring;
  r:integer;

procedure RtlInitUnicodeString(
  DestinationString:pstring;
  SourceString:pwidechar) stdcall; external 'NTDLL';

function RtlCompareUnicodeString(
  String1:pstring;
  String2:pstring;
  CaseInSensitive:boolean
  ):integer stdcall; external 'NTDLL';


begin

  x:='wef';
  y:='fsd';

  RtlInitUnicodeString(@k, pwidechar(x));
  RtlInitUnicodeString(@l, pwidechar(y));

  r:=RtlCompareUnicodeString(@k,@l,false);

  writeln(r);
  readln;

end.

I realise this is most likely wrong, I am not used to calling api unctions directly, this is my best guess.

About your StringCompareEx api function. That looked really good, but is avail on Vista + only, I'm using XP. StringCompare is on XP, but that's not Unicode!

To recap, the basic task afoot, is to compare two strings, and to do so based on the character sort order specified in the current windows locale.

Can anyone say for sure if ansicomparetext should do this or not? It don't work for me, but others have said it should, and other things i have read suggest it should.

This is what I get with 31 test strings when using AnsiCompareText when in German Locale (space delimited - no strings contain spaces) :-

  • arß Asß asß aßs no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

EDIT 2.

I am still keen to hear if I should expect AnsiCompareText to work using the locale info, as lkessler has said so, and lkessler has also posted about these subjects before and seems have been through this before.

However, following on from Rudy's advice I have also been checking out CompareStringW - which shares the same documentation with CompareString, so it is NOT non-unicode as I have stated earlier.

Even if AnsiCompareText is not going to work, although I think it should, the win32api function CompareStringW should indeed work. Now I have defined my API function, and I can call it, and I get a result, and no error... but i get the same result everytime regardless of the input strings! It returns 1 everytime - which means less than. Here's my code

var
  k,l:string;

function CompareStringW(
  Locale:integer;
  dwCmpFlags:longword;
  lpString1:pstring;
  cchCount1:integer;
  lpString2:pstring;
  cchCount2:integer
  ):integer stdcall; external 'Kernel32.dll';

begin;

  k:='zzz';
  l:='xxx';

  writeln(length(k));
  r:=comparestringw(LOCALE_USER_DEFAULT,0,@k,3,@l,3);

  writeln(r); // result is 1=less than, 2=equal, 3=greater than
  readln;

end;

I feel I am getting somewhere now after much pain. Would be glad to know about AnsiCompareText, and what I am doing wrong with the above CompareStringW api call. Thank you.


EDIT 3

Firstly, I fixed the api call to CompareStringW myself, I was passing in @mystring when I should do PString(mystring). Now it all works correctly.

r:=comparestringw(LOCALE_USER_DEFAULT,0,pstring(k),-1,pstring(l),-1);

Now, you can imagine my dismay when I still got the same sort result as I did right at the beginning...

  • arß asß aßs Asß no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

You may also imagine my EXTREME dismay not to mention simultaneous joy when I realised the sort order IS CORRECT, and IT WAS CORRECT RIGHT BACK IN THE BEGGINING! It make sme sick to say it, but there was never any problem in the first place - this is all down to my lack of German knowledge. I beleived the sort was wrong, since you can see above string start with S, then later they start with ß, then s again and back to ß and so on. Well I can't speak German however I could still clearly see that they was not sorted correctly - my German friend told me ß comes after S and before T... I WAS WRONG! What is happening is that string functions (both AnsiCompareText and winapi CompareTextW) are SUBSTITUTING every 'ß' with 'ss', and every 'ö' with a normal 'o'... so if i take those result above and to a search and replace as described I get...

  • arss asss asss Asss no no o on oo oo ooo ooo op po ss SS ssass ssbss sss Sssa Sssb ssss ssss SSSS ssssss ssssss SSssss ssz sszss z zzz

Looks pretty correct to me! And it always was.

I am extremely grateful for all the advice given, and extremely sorry to have wasted your time like this. Those german ß's got me all confused, there was never nothing wrong with the built in delphi function or anything else. It just looked like there was. I made the mistake of combining them with normal 's' in my test data, any other letter would have not have created this illusion of un-sortedness! The squiggly ß's have made me look a fool! ßs!

Rudy and lkessler we're both especially helpful, ty both, I have to accept lkessler's answer as most correct, sorry Rudy.

like image 580
csharpdefector Avatar asked Aug 13 '11 15:08

csharpdefector


People also ask

How does Python compare strings alphabetically?

To compare strings alphabetically in Python, you can use the < (less than), > (greater than), <= (less than or equal to), and >= (greater than or equal to) operators. Note that uppercase letters come before lowercase letters.

Does sort work on strings?

The sorted() function returns a sorted list of the specified iterable object. You can specify ascending or descending order. Strings are sorted alphabetically, and numbers are sorted numerically. Note: You cannot sort a list that contains BOTH string values AND numeric values.


1 Answers

You said you had problems calling Windows API calls yourself. Could you post the code, so people here can see why it failed? It is not as hard as it may seem, but it does require some care. ISTM that RtlCompareUnicodeStrings() is too low level.

I found a few solutions:

Non-portable

You could use the Windows API function CompareStringEx. This will compare using Unicode specific collation types. You can specify how you want this done (see link). It does require wide strings, i.e. PWideChar pointers to them. If you have problems calling it, give a holler and I'll try to add some demo code.

More or less portable

To make this more or less portable, you could write a function that compares two strings and use conditional defines to choose the different comparison APIs for the platform.

like image 100
Rudy Velthuis Avatar answered Oct 03 '22 16:10

Rudy Velthuis