Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Unicode, why are there two representations for the Arabic digits?

Tags:

I was reading the specification of Unicode @ Wikipedia (Arabic Unicode) and I see that each of the Arabic digits has 2 Unicode code points. For example 1 is defined as U+0661 and as U+06F1.

Which one should I use?

like image 819
Karim Avatar asked Nov 04 '09 20:11

Karim


People also ask

Why are Arabic numbers different?

The numerals used in the middle east today are not those which gave rise to "arabic" numerals used throughout the world. The origin of the numerals familiar to us today is the western arabic world of Andalusia/Morocco.

How are numbers represented in Unicode?

Unicode has a number of characters specifically designated as Roman numerals, as part of the Number Forms range from U+2160 to U+2188. This range includes both upper- and lowercase numerals, as well as pre-combined characters for numbers up to 12 (Ⅻ or XII).

Why do we use Arabic numbers?

Western nations call them Arabic because Europe got the numerals from the Islamic world, which got them from the Hindus. (People used to pay less attention to the subtleties of multiculturalism.)

Why is it called Hindu-Arabic numerals?

The Hindu-Arabic or Indo-Arabic numerals were invented by mathematicians in India. Persian and Arabic mathematicians called them "Hindu numerals". Later they came to be called "Arabic numerals" in Europe because they were introduced to the West by Arab merchants.


2 Answers

According to the code charts, U+0660 .. U+0669 are ARABIC-INDIC DIGIT values 0 through 9, while U+06F0 .. U+06F9 are EXTENDED ARABIC-INDIC DIGIT values 0 through 9.

In the Unicode 3.0 book (5.2 is the current version, but these things don't change much once set), the U+066n series of glyphs are marked 'Arabic-Indic digits' and the U+06Fn series of glyphs are marked 'Eastern Arabic-Indic digits (Persian and Urdu)'. It also notes:

  • U+06F4 - 'different glyphs in Persian and Urdu'
  • U+06F5 - 'Persian and Urdu share glyph different from Arabic'
  • U+06F6 - 'Persian glyph different from Arabic'
  • U+06F7 - 'Urdu glyph different from Arabic'

For comparison:

  • U+066n: ٠١٢٣٤٥٦٧٨٩
  • U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or, enlarged by making the information into a title:

U+066n: ٠١٢٣٤٥٦٧٨٩

U+06Fn: ۰۱۲۳۴۵۶۷۸۹

Or:

     U+066n    U+06Fn 0      ٠         ۰ 1      ١         ۱ 2      ٢         ۲ 3      ٣         ۳ 4      ٤         ۴ 5      ٥         ۵ 6      ٦         ۶ 7      ٧         ۷ 8      ٨         ۸ 9      ٩         ۹ 

(Whether you can see any of those, and how clearly they are differentiated may depend on your browser and the fonts installed on your machine as much as anything else. I can see the difference on 4 and 6 clearly; 5 looks much the same in both.)

Based on this information, if you are working with Arabic from the Middle East, use the U+066n series of digits; if you are working with Persian or Urdu, use the U+06Fn series of digits. As a Unicode application, you should accept either set of codes as valid digits (but you might look askance at a sequence that mixed the two sets of digits - or you might just leave well alone).

like image 173
Jonathan Leffler Avatar answered Oct 06 '22 14:10

Jonathan Leffler


In general you should not hard-code such info in your application.

  • On Windows you can use GetLocaleInfo with LOCALE_SNATIVEDIGITS.
  • On Mac CFNumberFormatterCopyProperty with kCFNumberFormatterZeroSymbol.
  • Or use something like ICU.

There are Arabic countries that don't use the Arabic-Indic digits by default. So there is no direct mapping saying Arabic -> Arabic-Indic digits.

And the user might have changed the defaults in the Control Panel anyway.

like image 42
Mihai Nita Avatar answered Oct 06 '22 14:10

Mihai Nita