Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if a string contains characters other than persian/arabic characters in python

Tags:

python

Is there a way to check if a string contains any non-arabic, non-persian characters in python?

like image 939
Navid777 Avatar asked Jan 28 '23 11:01

Navid777


1 Answers

I don't know of any libraries that classify Unicode codepoints into scripts.

You could use search the name of the block, or the name of the character; something like this:

name = unicodedata.name(ch).lower()
if 'arabic' in name or 'persian' in name:
    # ...

But that's pretty hacky. For example, that will include things like the Old Persian script, but not Rumi numerals, and I suspect that if you want one of those, you also want the other.

So, what you really want to do is look at the Unicode standard to see all the blocks that contain Arabic and Persian glyphs, and decide which ones you do and don't want to include.

Or, for a shortcut: Wikipedia has an article on Arabic script in Unicode. If you're looking at this answer from the future, you may want to verify that this is up-to-date for Unicode 23.0 with the new Space Persian letters and all that, but as of today, it looks up to date. So, I'm going to copy from there:

  • Arabic (0600–06FF, 255 characters)
  • Arabic Supplement (0750–077F, 48 characters)
  • Arabic Extended-A (08A0–08FF, 73 characters)
  • Arabic Presentation Forms-A (FB50–FDFF, 611 characters)
  • Arabic Presentation Forms-B (FE70–FEFF, 141 characters)
  • Rumi Numeral Symbols (10E60–10E7F, 31 characters)
  • Arabic Mathematical Alphabetic Symbols (1EE00—1EEFF, 143 characters)

If you wanted all of those, you'd just do this:

if ('\u0600' <= ch <= '\u06FF' or
    '\u0750' <= ch <= '\u077F' or
    '\u08A0' <= ch <= '\u08FF' or
    '\uFB50' <= ch <= '\uFDFF' or
    '\uFE70' <= ch <= '\uFEFF' or
    '\U00010E60' <= ch <= '\U00010E7F' or
    '\U0001EE00' <= ch <= '\U0001EEFF'):
    # ...

Of course I doubt you want all of those, but it should be obvious how to modify it to match the ones you do want.

And you're probably going to want some other characters that aren't Arabic or Persian—e.g., maybe or ch.isspace(), or another range change, or a character class check.

And if you want to turn this into a regex, you should be able to figure out how to write a regex character class containing the ranges you want. (If not, you shouldn't be using regex.)

like image 97
abarnert Avatar answered Feb 19 '23 03:02

abarnert