Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expressions in PHP: Matching to the UTS18 standard

Tags:

regex

php

unicode

The Unicode Common Locale Data Repository (CLDR) has a wealth of information regarding the relationship between languages and characters. For example, you can determine which characters are utilized in a particular language by looking at the misc.exemplarCharacters chart. The raw data for these charts are stored as XML files and the exemplar characters are stored as regular expressions according to the Unicode Regular Expressions standard UTS18.

Here's a few examples of what UTS18 regex expressions look like:

1. [a à b c ç d e é è f g h i í ï j k l ŀ m n o ó ò p q r s t u ú ü v w x y z]
2. [অ আ ই ঈ উ ঊ ঋ এ ঐ ও ঔ ং \u0981 ঃ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড {ড\u09BC}ড় ঢ {ঢ\u09BC}ঢ় ণ ত থ দ ধ ন প ফ ব ভ ম য {য\u09BC} ৰ ল ৱ শ ষ স হ া ি ী \u09C1 \u09C2 \u09C3 ে ৈ ো ৌ \u09CD]
3. [a á b ɓ c d ɗ e é ɛ {ɛ\u0301} f g i í j k l m n {ny} ŋ o ó ɔ {ɔ\u0301} p r s t u ú ū w y]

I'm using PHP and SimpleXML to parse the XML data and isolate these regex strings. Now, I would like to match individual multi-byte characters to these regular expressions. I'm currently using the mb_ereg_match function, which yields one or more of the following warnings (depending on the regex):

mbregex compile err: premature end of char-class in ...
mbregex compile err: empty range in char class in ...
mbregex compile err: empty char-class in ...

Any ideas as to why this isn't working?

like image 749
David Jones Avatar asked Nov 04 '22 21:11

David Jones


1 Answers

As suggested by Sergey, I added the following lines before calling the mb_ereg_match() function:

mb_internal_encoding('UTF-8');
mb_regex_encoding('UTF-8');

This addition eliminated two of the warnings listed above. I was only left with the following warning:

mbregex compile err: empty char-class in ...

After some additional debugging, I discovered that a handful of the CLDR XML files do in fact contain empty regular expression strings. For example, in kn.xml we have the following line:

<exemplarCharacters type="auxiliary">[]</exemplarCharacters>

I believe these lines are erroneous, as the expected behavior would be to simply leave the line out altogether (which is mostly the case throughout the CLDR).

Thus, I was able to eliminate this last error by simply throwing out empty regex strings.

Hope this helps someone else!

like image 192
David Jones Avatar answered Nov 09 '22 17:11

David Jones