Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delphi XE AnsiStrings with escaped combining diacritical marks

What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?

I am aware of the fact that this is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.

like image 656
Erwin Jurschitza Avatar asked Nov 18 '10 12:11

Erwin Jurschitza


3 Answers

I think you need to perform Unicode Normalization. on your string.

I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:

NormalizationKC

Unicode normalization form KC, compatibility composition. Transforms each base plus combining characters to the canonical precomposed equivalent and all compatibility characters to their equivalents. For example, the ligature fi becomes f + i; similarly, A + ¨ + fi + n becomes Ä + f + i + n.

like image 174
Roddy Avatar answered Oct 24 '22 23:10

Roddy


Here is the complete code that solved my problem:

function Unescape(const s: AnsiString): string;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1;
  j := 1;
  while i <= Length(s) do begin
    if s[i] = '\' then begin
      if i < Length(s) then begin
        // escaped backslash?
        if s[i + 1] = '\' then begin
          Result[j] := '\';
          inc(i, 2);
        end
        // convert hex number to WideChar
        else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s)) 
                and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin
          inc(i, 6);
          Result[j] := WideChar(c);
        end else begin
          raise Exception.CreateFmt('Invalid code at position %d', [i]);
        end;
      end else begin
        raise Exception.Create('Unexpected end of string');
      end;
    end else begin
      Result[j] := WideChar(s[i]);
      inc(i);
    end;
    inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j - 1);
end;

const
  NormalizationC = 1;

function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer;
 lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';

function Normalize(const s: string): string;
var
  newLength: integer;
begin
  // in NormalizationC mode the result string won't grow longer than the input string
  SetLength(Result, Length(s));
  newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result));
  SetLength(Result, newLength);
end;

function UnescapeAndNormalize(const s: AnsiString): string;
begin
  Result := Normalize(Unescape(s));
end;

Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)

like image 22
Erwin Jurschitza Avatar answered Oct 24 '22 23:10

Erwin Jurschitza


Are they always escaped like this? Always in a number of 4 digits?

How is the \ character itself escaped?

Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:

function Unescape(s: AnsiString): WideString;
var
  i: Integer;
  j: Integer;
  c: Integer;
begin
  // Make result at least large enough. This prevents too many reallocs
  SetLength(Result, Length(s));
  i := 1; j := 1;
  while i <= Length(s) do
  begin
     // If a '\' is found, typecast the following 4 digit integer to widechar
     if s[i] = '\' then
     begin
       if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
         raise Exception.CreateFmt('Invalid code at position %d', [i]);

       Inc(i, 6);
       Result[j] := WideChar(c);
     end
     else
     begin
       Result[j] := WideChar(s[i]);
       Inc(i);
     end;
     Inc(j);
  end;

  // Trim result in case we reserved too much space
  SetLength(Result, j-1);
end;

Use like this

  MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);

This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.

[edit] Code is ok. Highlighter fails.

like image 1
GolezTrol Avatar answered Oct 24 '22 21:10

GolezTrol