I have the following text written in Arabic, and when I call text.characters.count
it returns 298 characters instead of the real number, which is 300.
The text:
هنالك العديد من الأنواع المتوفرة لنصوص لوريم إيبسوم، ولكن الغالبية تم تعديلها بشكل ما عبر إدخال بعض النوادر أو الكلمات العشوائية إلى النص. إن كنت تريد أن تستخدم نص لوريم إيبسوم ما، عليك أن تتحقق أولاً أن ليس هناك أي كلمات أو عبارات محرجة أو غير لائقة مخبأة في هذا النص. بينما تعمل جميع مولّدات نصوص ا
with the mention that there is no surrounding white space, before and after the text.
utf8.characters.count
also returns the same wrong number.
How can I obtain the right number of characters given such a string ?
Getting the unicode scalar count should gives you the expected result:
let myString = "هنالك العديد من الأنواع المتوفرة لنصوص لوريم إيبسوم، ولكن الغالبية تم تعديلها بشكل ما عبر إدخال بعض النوادر أو الكلمات العشوائية إلى النص. إن كنت تريد أن تستخدم نص لوريم إيبسوم ما، عليك أن تتحقق أولاً أن ليس هناك أي كلمات أو عبارات محرجة أو غير لائقة مخبأة في هذا النص. بينما تعمل جميع مولّدات نصوص ا"
myString.unicodeScalars.count // 300
As mentioned in the Swift - Strings and Characters:
Behind the scenes, Swift’s native String type is built from Unicode scalar values. A Unicode scalar is a unique 21-bit number for a character or modifier, such as U+0061 for LATIN SMALL LETTER A ("a"), or U+1F425 for FRONT-FACING BABY CHICK ("🐥").
However
Regardless of what is the result that you expect, counting "harakat" (separators) like "Fat-ha", "damma", "kasra" as a separated character probably gives wrong result.
For instance: if you tried to check the count of "أولاً" word, you will notice that:
let myString = "أولاً"
myString.characters.count // 4
myString.unicodeScalars.count // 5
as you can see, the TanweenFat-ha character is not counted as a separated character unless you are you counting its unicodeScalars value.
As you mentioned, it seems that charactercountonline.com is counting the "harakat" (separators) as independent characters, which should be logical for non-Arabic speaking people, but that will be wrong counting.
Remark For non-Arabic speaking viewers:
The word "أولاً" contains a Decimal Separator, called "Fat-hatan" or "Tanween Fat-h", this separator should not be counted as a separated character referring to Arabic language grammars; The purpose of using it is to indicates how the spelling of the word should be. This logic should be obvious for Arabic speaking people, that "أولاً" word contains four characters, but not computers when it is related to counting!
[...] when I call text.characters.count it returns 298 characters instead of the real number, which is 300.
All boils down to the definition of a character (of which there are several).
Swift's definition somewhat differs from most other computer languages as it defines a character as a "single extended grapheme cluster":
An extended grapheme cluster is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character.
So when working with the "character count" it's important to think about what actually want to know: Is it what a human would perceive as a characters—or is it about some (computer) encoding?
Without a proper definition there is no "correct" answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With