Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Allow only (English & Arabic) in UTF-8 code

I am looking for a regex to change all non-english and/or arabic into underscore "_"

Currently I have the following code which works but I think that I've got the wrong unicode

range as it allows Chinese & other languages I don't require in my script.

$title=~tr/[a-z0-9_\x7f-\xff]/_/cd;

Any help would be appreciated

like image 224
Tareq Avatar asked Feb 05 '26 06:02

Tareq


2 Answers

If you're seeing bytes between \x7f and \xff, your application is probably working with UTF-8 bytes, not Unicode characters. Read perldoc perlunicode, then decode() your strings before trying to work with them on this level.

Once that's done, you should be able to search for English and Arabic characters with something like:

/[\p{ASCII}\p{Arabic}]/

See perldoc perluniprops for other Unicode properties you can use.

The range of the Arabic (Indic) digits is: \x{0660}-\x{0669}

The range of the Arabic letters is: \x{0621}-\x{063A}\x{0641}-\x{064A}

The range of the Arabic vowels including "Tatweel" is: \x{0640}\x{064B}-\x{0652}

The range of the Arabic puncation is: \x{060C}\x{060D}\x{061B}-\x{061F}\x{2E2E}\x{066A}-\x{066D}

like image 25
khaled.alshamaa Avatar answered Feb 06 '26 19:02

khaled.alshamaa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!