Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a single Arabic letter in a string with its Unicode transformation value in DELPHI?

Tags:

delphi

Considering this Arabic word(جبل) made of 3 letters .

-the first letter is جـ, -name is (ǧīm), -its Unicode value is FE9F when its in the beginning, -its basic value is 062C and -its isolated value is FE9D but the last two values return the same shape drawing ج .

Now, Whenever I try to get it as a single character -trying many different ways-, Delphi returns the basic Unicode value. well,that makes sense,but what happens to the char with transformation? It is a single char too..Looks like it takes the transformed value only when it is within a string, but where? how to extract it?When and which process decides these values? Again the MAIN QUESTION: How can I get the Arabic letter or its Unicode value as it is within a string?

just for information: Unlike English which has tow cases for its letters(Capital and Small), Arabic has four cases(Isolated, Beginning,Middle And End) with different rules as well.

like image 877
Hasan Avatar asked May 15 '13 03:05

Hasan


People also ask

Does Unicode include Arabic?

As of Unicode 15.0, the Arabic script is contained in the following blocks: Arabic (0600–06FF, 256 characters) Arabic Supplement (0750–077F, 48 characters) Arabic Extended-B (0870–089F, 41 characters)

Are Arabic characters UTF 8?

In order for the Arabic characters to be displayed in URLs in your browser the characters are encoded into a Latin based encoding called UTF-8 which typically are a 4 character hexadecimal string. An example would be the Arabic letter و WAW which is converted to D988.


1 Answers

I'm not sure I understand the question. If you want to know how to write U+FE9F in Delphi source code, in a modern Unicode version of Delphi. Do that simply like so:

Char($FE9F)

If you want to read individual characters from جبل then do it like this:

const
  MyWord = 'جبل';
var
  c: Char;
....
c := MyWord[1];//this is U+062C

Note that the code above is fine for your particular word because each code point can be encoded with a single UTF-16 WideChar character element. If the code point required multiple elements, then it would be best to transform to UTF-32 for code point level processing.


Now, let's look at the string that you included in the question. I downloaded this question using wget and the file that came down the wires was UTF-8 encoded. I used Notepad++ to convert to UTF16-LE and then picked out the three UTF-16 characters of your string. They are:

U+062C
U+0628
U+0644

You stated:

The first letter is جـ, name is (ǧīm), its Unicode value is U+FE9F.

But that is simply incorrect. As can be seen from the above, the actual character you posted was U+062C. So the reason why your attempts to read the first character yield U+062C is that U+062C really is the first character of your string.


The bottom line is that nothing in your Delphi code is transforming your character. When you do:

S[1] := Char($FE9F);

the compiler performs a simple two byte copy. There is no context aware transformation that occurs. And likewise when reading S[1].


Let's look at how these characters are displayed, using this simple code on a VCL forms application that contains a memo control:

Memo1.Clear;
Memo1.Lines.Add(StringOfChar(Char($FE9F), 2));
Memo1.Lines.Add(StringOfChar(Char($062C), 2));

The output looks like this:

enter image description here

As you can see, the rendering layer knows what to do with a U+062C character that appears at the beginning of the string.

like image 181
David Heffernan Avatar answered Oct 02 '22 11:10

David Heffernan