Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Can I Run a Regex that Tests Text for Characters in a Particular Alphabet or Script?

Tags:

regex

perl

I'd like to make a regex in Perl that will test a string for characters in a particular script. This would be something like:

$text =~ .*P{'Chinese'}.*

Is there a simple way of doing this, for English it's pretty easy by just testing for [a-zA-Z], but for a script like Chinese, or one of the Japanese scripts, I can't figure out any way of doing this short of writing out every character explicitly, which would make for some very ugly code. Ideas? I can't be the first/only person that's wanted to do this.

like image 940
Eli Avatar asked Nov 30 '11 22:11

Eli


People also ask

How do I test a letter in regex?

To check if a string contains at least one letter using regex, you can use the [a-zA-Z] regular expression sequence in JavaScript. The [a-zA-Z] sequence is to match all the small letters from a-z and also the capital letters from A-Z . This should be inside a square bracket to define it as a range.

Which below regex is applicable for alphabets?

[A-Za-z] will match all the alphabets (both lowercase and uppercase).

How do I match a character in regex?

In regular expressions, we can match any character using period "." character. To match multiple characters or a given set of characters, we should use character classes.


1 Answers

Look at perldoc perluniprops, which provides an exhaustive list of properties you can use with \p. You’ll be interested in \p{CJK_Unified_Ideographs} and related properties such as \p{CJK_Symbols_And_Punctuation}. \p{Hiragana} and \p{Katakana} give you the kana. There is also a \p{Script=...} property for a number of scripts: \p{Han} and \p{Script=Han} match Han characters (Chinese), but there is no corresponding \p{Script=Japanese}, quite simply because Japanese has multiple scripts.

like image 194
Jon Purdy Avatar answered Sep 23 '22 01:09

Jon Purdy