I need to write a program which will browse through strings of various lengths and select only those which are written using symbols from set defined by me (particularly Japanese letters). Strings will contain words written in different languages (German, French, Arabic, Russian, English etc). Obviously there is huge number of possible characters. I do not know which structure to use for that? I am using Delphi 7 right now. Can anybody suggest how to write such program?
The delphi string is an implementation of UTF-16 in which the "normal" graphic characters corresponding to 1 Char (2 bytes), but having others graphic characters being represented by 2 Chars (4 bytes), the so-called surrogate pair. There is no such thing as a "normal graphic character" in Unicode.
Unicode is a standard encoding system that is used to represent characters from almost all languages. Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . A Unicode string is a sequence of zero or more code points.
The Delphi language supports short-string types - in effect, subtypes of ShortString - whose maximum length is anywhere from 0 to 255 characters.
A string represents a sequence of characters. Delphi supports the following predefined string types. String types. Type. Maximum length.
Obviously you would be better off with Delphi 2010, since the VCL in delphi 7 is not aware of Unicode strings. You can use WideString types, and WideChar types in Delphi 7, and you can install a component set like the TNT Unicode Components to help you create a user interface that can display your results.
For a very-large-set type, consider using a bit array like TBits. A bit array of length 65536 would hold enough to contain every UTF-16 code-point. Checking if Char X is in Set Y, would be basically:
function WideCharsInSet( wcstr:WideString; wcset:TBits):Boolean;
var
n:Integer;
wc:WideChar;
begin
result := false;
for n := 1 to Length(wcstr) do begin
wc := wcstr[n];
if wcset[Ord(wc)] then
result := true;
end;
end;
procedure Demo;
var
wcset1:TBits;
s:WideString;
begin
wcset1 := TBits.Create;
try
// 1157 - Hangul Korean codepoint I found with Char Map
wcset1[1157] := true;
// go get a string value s:
s := WideChar(1157);
// return true if at least one element in set wcset is found in string s:
if WideCharsInSet(s,wcset1) then begin
Application.MessageBox('Found it','found it',MB_OK);
end;
finally
wcset1.Free;
end;
end;
For the simple processing of strings in the manner you describe, do not be put off by suggestions that you should upgrade to the latest compiler and Unicode enabled framework. The Unicode support itself is of course provided by the underlying Windows API which is of course (directly) accessible from "non-Unicode" versions of Delphi just as much as from "Unicode versions".
I suspect that most if not all of the Unicode support that you need for the purposes outlined in your question can be obtained from the Unicode support provided in the JEDI JCL.
For any visual component support you may require the TNT control set has the appeal of being free.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With