Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Swift: string.characters.count returns wrong number for an Arabic string

Tags:

string

ios

swift

I have the following text written in Arabic, and when I call text.characters.count it returns 298 characters instead of the real number, which is 300.

The text:

هنالك العديد من الأنواع المتوفرة لنصوص لوريم إيبسوم، ولكن الغالبية تم تعديلها بشكل ما عبر إدخال بعض النوادر أو الكلمات العشوائية إلى النص. إن كنت تريد أن تستخدم نص لوريم إيبسوم ما، عليك أن تتحقق أولاً أن ليس هناك أي كلمات أو عبارات محرجة أو غير لائقة مخبأة في هذا النص. بينما تعمل جميع مولّدات نصوص ا

with the mention that there is no surrounding white space, before and after the text.

utf8.characters.count also returns the same wrong number.

How can I obtain the right number of characters given such a string ?

like image 280
JAHelia Avatar asked Aug 24 '17 09:08

JAHelia


2 Answers

Getting the unicode scalar count should gives you the expected result:

let myString = "هنالك العديد من الأنواع المتوفرة لنصوص لوريم إيبسوم، ولكن الغالبية تم تعديلها بشكل ما عبر إدخال بعض النوادر أو الكلمات العشوائية إلى النص. إن كنت تريد أن تستخدم نص لوريم إيبسوم ما، عليك أن تتحقق أولاً أن ليس هناك أي كلمات أو عبارات محرجة أو غير لائقة مخبأة في هذا النص. بينما تعمل جميع مولّدات نصوص ا"

myString.unicodeScalars.count // 300

As mentioned in the Swift - Strings and Characters:

Behind the scenes, Swift’s native String type is built from Unicode scalar values. A Unicode scalar is a unique 21-bit number for a character or modifier, such as U+0061 for LATIN SMALL LETTER A ("a"), or U+1F425 for FRONT-FACING BABY CHICK ("🐥").

However

Regardless of what is the result that you expect, counting "harakat" (separators) like "Fat-ha", "damma", "kasra" as a separated character probably gives wrong result.

For instance: if you tried to check the count of "أولاً" word, you will notice that:

let myString = "أولاً"

myString.characters.count // 4
myString.unicodeScalars.count // 5

as you can see, the TanweenFat-ha character is not counted as a separated character unless you are you counting its unicodeScalars value.

As you mentioned, it seems that charactercountonline.com is counting the "harakat" (separators) as independent characters, which should be logical for non-Arabic speaking people, but that will be wrong counting.


Remark For non-Arabic speaking viewers:

The word "أولاً" contains a Decimal Separator, called "Fat-hatan" or "Tanween Fat-h", this separator should not be counted as a separated character referring to Arabic language grammars; The purpose of using it is to indicates how the spelling of the word should be. This logic should be obvious for Arabic speaking people, that "أولاً" word contains four characters, but not computers when it is related to counting!

like image 163
Ahmad F Avatar answered Sep 28 '22 03:09

Ahmad F


[...] when I call text.characters.count it returns 298 characters instead of the real number, which is 300.

All boils down to the definition of a character (of which there are several).

Swift's definition somewhat differs from most other computer languages as it defines a character as a "single extended grapheme cluster":

An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.

So when working with the "character count" it's important to think about what actually want to know: Is it what a human would perceive as a characters—or is it about some (computer) encoding?

Without a proper definition there is no "correct" answer.

like image 44
Nikolai Ruhe Avatar answered Sep 28 '22 02:09

Nikolai Ruhe