What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?
I am aware of the fact that this is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.
I think you need to perform Unicode Normalization. on your string.
I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:
NormalizationKC
Unicode normalization form KC, compatibility composition. Transforms each base plus combining characters to the canonical precomposed equivalent and all compatibility characters to their equivalents. For example, the ligature fi becomes f + i; similarly, A + ¨ + fi + n becomes Ä + f + i + n.
Here is the complete code that solved my problem:
function Unescape(const s: AnsiString): string; var i: Integer; j: Integer; c: Integer; begin // Make result at least large enough. This prevents too many reallocs SetLength(Result, Length(s)); i := 1; j := 1; while i <= Length(s) do begin if s[i] = '\' then begin if i < Length(s) then begin // escaped backslash? if s[i + 1] = '\' then begin Result[j] := '\'; inc(i, 2); end // convert hex number to WideChar else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s)) and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin inc(i, 6); Result[j] := WideChar(c); end else begin raise Exception.CreateFmt('Invalid code at position %d', [i]); end; end else begin raise Exception.Create('Unexpected end of string'); end; end else begin Result[j] := WideChar(s[i]); inc(i); end; inc(j); end; // Trim result in case we reserved too much space SetLength(Result, j - 1); end; const NormalizationC = 1; function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer; lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll'; function Normalize(const s: string): string; var newLength: integer; begin // in NormalizationC mode the result string won't grow longer than the input string SetLength(Result, Length(s)); newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result)); SetLength(Result, newLength); end; function UnescapeAndNormalize(const s: AnsiString): string; begin Result := Normalize(Unescape(s)); end;
Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)
Are they always escaped like this? Always in a number of 4 digits?
How is the \ character itself escaped?
Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:
function Unescape(s: AnsiString): WideString;
var
i: Integer;
j: Integer;
c: Integer;
begin
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1; j := 1;
while i <= Length(s) do
begin
// If a '\' is found, typecast the following 4 digit integer to widechar
if s[i] = '\' then
begin
if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
raise Exception.CreateFmt('Invalid code at position %d', [i]);
Inc(i, 6);
Result[j] := WideChar(c);
end
else
begin
Result[j] := WideChar(s[i]);
Inc(i);
end;
Inc(j);
end;
// Trim result in case we reserved too much space
SetLength(Result, j-1);
end;
Use like this
MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);
This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.
[edit] Code is ok. Highlighter fails.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With