I was reading the specification of Unicode @ Wikipedia (Arabic Unicode) and I see that each of the Arabic digits has 2 Unicode code points. For example 1 is defined as U+0661 and as U+06F1. Which one should I use?

According to the code charts, U+0660 .. U+0669 are ARABIC-INDIC DIGIT values 0 through 9, while U+06F0 .. U+06F9 are EXTENDED ARABIC-INDIC DIGIT values 0 through 9. In the Unicode 3.0 book (5.2 is the current version, but these things don't change much once set), the U+066n series of glyphs are marked 'Arabic-Indic digits' and the U+06Fn series of glyphs are marked 'Eastern Arabic-Indic digits (Persian and Urdu)'. It also notes: <ul> <li>U+06F4 - 'different glyphs in Persian and Urdu'</li> <li>U+06F5 - 'Persian and Urdu share glyph different from Arabic'</li> <li>U+06F6 - 'Persian glyph different from Arabic'</li> <li>U+06F7 - 'Urdu glyph different from Arabic'</li> </ul> For comparison: <ul> <li>U+066n: ٠١٢٣٤٥٦٧٨٩</li> <li>U+06Fn: ۰۱۲۳۴۵۶۷۸۹</li> </ul> Or, enlarged by making the information into a title: <h3>U+066n: ٠١٢٣٤٥٦٧٨٩</h3> <h3>U+06Fn: ۰۱۲۳۴۵۶۷۸۹</h3> Or: <pre class="prettyprint"><code> U+066n U+06Fn 0 ٠ ۰ 1 ١ ۱ 2 ٢ ۲ 3 ٣ ۳ 4 ٤ ۴ 5 ٥ ۵ 6 ٦ ۶ 7 ٧ ۷ 8 ٨ ۸ 9 ٩ ۹ </code></pre> (Whether you can see any of those, and how clearly they are differentiated may depend on your browser and the fonts installed on your machine as much as anything else. I can see the difference on 4 and 6 clearly; 5 looks much the same in both.) Based on this information, if you are working with Arabic from the Middle East, use the U+066n series of digits; if you are working with Persian or Urdu, use the U+06Fn series of digits. As a Unicode application, you should accept either set of codes as valid digits (but you might look askance at a sequence that mixed the two sets of digits - or you might just leave well alone).

In general you should not hard-code such info in your application. <ul> <li>On Windows you can use GetLocaleInfo with LOCALE_SNATIVEDIGITS.</li> <li>On Mac CFNumberFormatterCopyProperty with kCFNumberFormatterZeroSymbol.</li> <li>Or use something like ICU.</li> </ul> There are Arabic countries that don't use the Arabic-Indic digits by default. So there is no direct mapping saying Arabic -> Arabic-Indic digits. And the user might have changed the defaults in the Control Panel anyway.

In Unicode, why are there two representations for the Arabic digits?

2 Answers

According to the code charts, U+0660 .. U+0669 are ARABIC-INDIC DIGIT values 0 through 9, while U+06F0 .. U+06F9 are EXTENDED ARABIC-INDIC DIGIT values 0 through 9.

In the Unicode 3.0 book (5.2 is the current version, but these things don't change much once set), the U+066n series of glyphs are marked 'Arabic-Indic digits' and the U+06Fn series of glyphs are marked 'Eastern Arabic-Indic digits (Persian and Urdu)'. It also notes:

U+06F4 - 'different glyphs in Persian and Urdu'
U+06F5 - 'Persian and Urdu share glyph different from Arabic'
U+06F6 - 'Persian glyph different from Arabic'
U+06F7 - 'Urdu glyph different from Arabic'

For comparison:

U+066n: ٠١٢٣٤٥٦٧٨٩
U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or, enlarged by making the information into a title:

U+066n: ٠١٢٣٤٥٦٧٨٩

U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or:

     U+066n    U+06Fn 0      ٠         ۰ 1      ١         ۱ 2      ٢         ۲ 3      ٣         ۳ 4      ٤         ۴ 5      ٥         ۵ 6      ٦         ۶ 7      ٧         ۷ 8      ٨         ۸ 9      ٩         ۹

(Whether you can see any of those, and how clearly they are differentiated may depend on your browser and the fonts installed on your machine as much as anything else. I can see the difference on 4 and 6 clearly; 5 looks much the same in both.)

Based on this information, if you are working with Arabic from the Middle East, use the U+066n series of digits; if you are working with Persian or Urdu, use the U+06Fn series of digits. As a Unicode application, you should accept either set of codes as valid digits (but you might look askance at a sequence that mixed the two sets of digits - or you might just leave well alone).

173

answered Oct 06 '22 14:10

Jonathan Leffler

In general you should not hard-code such info in your application.

On Windows you can use GetLocaleInfo with LOCALE_SNATIVEDIGITS.
On Mac CFNumberFormatterCopyProperty with kCFNumberFormatterZeroSymbol.
Or use something like ICU.

There are Arabic countries that don't use the Arabic-Indic digits by default. So there is no direct mapping saying Arabic -> Arabic-Indic digits.

And the user might have changed the defaults in the Control Panel anyway.

answered Oct 06 '22 14:10

Mihai Nita

Related questions
                            
                                Need help with getline() [duplicate]
                            
                                How to detect Mac OS version using Python?
                            
                                How can I detect DOM ready and add a class without jQuery?
                            
                                Is there a utility to indent C++ programs [closed]
                            
                                yylval and union
                            
                                SVN commit failing due to missing file
                            
                                How do I always answer No to any prompt with a bash script?
                            
                                Posting an array with curl_setopt
                            
                                Prevent HTML encoding in auto-generated GridView columns
                            
                                How to create Startup and Cleanup script for Visual Studio Test Project?
                            
                                Cassandra file structure - how are the files used?
                            
                                Case Statements versus coded if statements

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In Unicode, why are there two representations for the Arabic digits?

Tags:

Karim

People also ask