Is there a way to check if a string contains any non-arabic, non-persian characters in python?
I don't know of any libraries that classify Unicode codepoints into scripts.
You could use search the name of the block, or the name of the character; something like this:
name = unicodedata.name(ch).lower()
if 'arabic' in name or 'persian' in name:
# ...
But that's pretty hacky. For example, that will include things like the Old Persian script, but not Rumi numerals, and I suspect that if you want one of those, you also want the other.
So, what you really want to do is look at the Unicode standard to see all the blocks that contain Arabic and Persian glyphs, and decide which ones you do and don't want to include.
Or, for a shortcut: Wikipedia has an article on Arabic script in Unicode. If you're looking at this answer from the future, you may want to verify that this is up-to-date for Unicode 23.0 with the new Space Persian letters and all that, but as of today, it looks up to date. So, I'm going to copy from there:
If you wanted all of those, you'd just do this:
if ('\u0600' <= ch <= '\u06FF' or
'\u0750' <= ch <= '\u077F' or
'\u08A0' <= ch <= '\u08FF' or
'\uFB50' <= ch <= '\uFDFF' or
'\uFE70' <= ch <= '\uFEFF' or
'\U00010E60' <= ch <= '\U00010E7F' or
'\U0001EE00' <= ch <= '\U0001EEFF'):
# ...
Of course I doubt you want all of those, but it should be obvious how to modify it to match the ones you do want.
And you're probably going to want some other characters that aren't Arabic or Persian—e.g., maybe or ch.isspace()
, or another range change, or a character class check.
And if you want to turn this into a regex, you should be able to figure out how to write a regex character class containing the ranges you want. (If not, you shouldn't be using regex.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With