regex for accepting only persian characters

Question

I'm working on a form where one of its custom validators should only accept Persian characters. I used the following code:

var myregex = new Regex(@"^[\u0600-\u06FF]+$"); if (myregex.IsMatch(mytextBox.Text)) {     args.IsValid = true; } else {     args.IsValid = false; }

However, it seems that it can only detect Arabic characters, as it doesn't cover all Persian characters (it lacks these four: گ,چ,پ,ژ ).

Is there a way to solve this problem?

revo · Accepted Answer

TL;DR

Farsi MUST used character sets are as following:

Use ^[آابپتثجچحخدذرزژسشصضطظعغفقکگلمنوهی]+$ for letters or use codepoints regarding your regex flavor (not all engines support \uXXXX notation):
```
^[\u0622\u0627\u0628\u067E\u062A-\u062C\u0686\u062D-\u0632\u0698\u0633-\u063A\u0641\u0642\u06A9\u06AF\u0644-\u0648\u06CC]+$ 
```
Use ^[۰۱۲۳۴۵۶۷۸۹]+$ for numbers or regarding your regex flavor:
```
^[\u06F0-\u06F9]+$ 
```
Use [ ‬ٌ ‬ًّ ‬َ ‬ِ ‬ُ ‬ْ ‬] for vowels or regarding your regex flavor:
```
[\u202C\u064B\u064C\u064E-\u0652] 
```

or a combination of those together. You may want to add other Arabic letters like Hamza ء to your character set additionally.

Why are `[\u0600-\u06FF]` and `[آ-ی]` both wrong?

Although `\u0600-\u06FF` includes:

گ with codepoint 06AF
چ with codepoint 0686
پ with codepoint 067E
ژ with codepoint 0698

as well, all answers that suggest `[\u0600-\u06FF]` or `[آ-ی]` are simply WRONG.

i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!

enter image description here

Whole story

This answer exists to fix a common misconception. Codepoints 0600 through 06FF do not denote Persian / Farsi alphabet (neither does [آ-ی]):

[\u0600-\u0605 ؐ-ؚ\u061Cـ ۖ-\u06DD ۟-ۤ ۧ ۨ ۪-ۭ ً-ٕ ٟ ٖ-ٞ ٰ ، ؍ ٫ ٬ ؛ ؞ ؟ ۔ ٭ ٪ ؉ ؊ ؈ ؎ ؏ ۞ ۩ ؆ ؇ ؋ ٠۰ ١۱ ٢۲ ٣۳ ٤۴ ٥۵ ٦۶ ٧۷ ٨۸ ٩۹ ءٴ۽ آ أ ٲ ٱ ؤ إ ٳ ئ ا ٵ ٮ ب ٻ پ ڀ ة-ث ٹ ٺ ټ ٽ ٿ ج ڃ ڄ چ ڿ ڇ ح خ ځ ڂ څ د ذ ڈ-ڐ ۮ ر ز ڑ-ڙ ۯ س ش ښ-ڜ ۺ ص ض ڝ ڞ ۻ ط ظ ڟ ع غ ڠ ۼ ف ڡ-ڦ ٯ ق ڧ ڨ ك ک-ڴ ػ ؼ ل ڵ-ڸ م۾ ن ں-ڽ ڹ ه ھ ہ-ۃ ۿ ەۀ وۥ ٶ ۄ-ۇ ٷ ۈ-ۋ ۏ ى يۦ ٸ ی-ێ ې ۑ ؽ-ؿ ؠ ے ۓ \u061D]

255 characters are fallen under Arabic block (0600–06FF), Farsi alphabet has 32 letters that in addition to Farsi demonstration of digits it would be 42. If we add vowels (Arabic vowels originally, that rarely used in Farsi) without Tanvin (ً, ٍِ ‬, ٌ ‬) and Tashdid (ّ ‬) that are both a subset of Arabic diacritics not Farsi, we would end up with 46 characters. This means \u0600-\u06FF contains 209 more characters than you need!

۷ with codepoint 06F7 is a Farsi representation of number 7 and ٧ with codepoint 0667 is Arabic representation of the same number. ۶ is Farsi representation of number 6 and ٦ is Arabic representation of the same number. And all reside in 0600 through 06FF codepoints.

The shapes of the Persian digits four (۴), five (۵), and six (۶) are different from the shapes used in Arabic and the other numbers have different codepoints.

You can see different number of other characters that doesn't exist in Farsi / Persian too and nobody is willing to have them while validating a first name or surname.

[آ-ی] includes 117 characters too which is much more than what someone needs for validation. You can see them all using Unicode CLDR.

regex for accepting only persian characters

Tags:

c#

regex

asp.net

unicode

Sara NikitaUsefi

1 Answers

TL;DR

Farsi MUST used character sets are as following:

Why are `[\u0600-\u06FF]` and `[آ-ی]` both wrong?

Although `\u0600-\u06FF` includes:

as well, all answers that suggest `[\u0600-\u06FF]` or `[آ-ی]` are simply WRONG.

i.e. `\u0600-\u06FF` contains 209 more characters than you need! and it includes numbers too!

Whole story

revo

Recent Activity

Donate For Us

regex for accepting only persian characters

Tags:

c#

regex

asp.net

unicode

Sara NikitaUsefi

1 Answers

TL;DR

Farsi MUST used character sets are as following:

Why are [\u0600-\u06FF] and [آ-ی] both wrong?

Although \u0600-\u06FF includes:

as well, all answers that suggest [\u0600-\u06FF] or [آ-ی] are simply WRONG.

i.e. \u0600-\u06FF contains 209 more characters than you need! and it includes numbers too!

Whole story

revo

Related questions

Recent Activity

Donate For Us

Why are `[\u0600-\u06FF]` and `[آ-ی]` both wrong?

Although `\u0600-\u06FF` includes:

as well, all answers that suggest `[\u0600-\u06FF]` or `[آ-ی]` are simply WRONG.

i.e. `\u0600-\u06FF` contains 209 more characters than you need! and it includes numbers too!