Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I match only fully-composed characters in a Unicode string in Perl?

I'm looking for a way to match only fully composed characters in a Unicode string.

Is [:print:] dependent upon locale in any regular expression implementation that incorporates this character class? For example, will it match Japanese character 'あ', since it is not a control character, or is [:print:] always going to be ASCII codes 0x20 to 0x7E?

Is there any character class, including Perl REs, that can be used to match anything other than a control character? If [:print:] includes only characters in ASCII range I would assume [:cntrl:] does too.

like image 951
dreamlax Avatar asked Oct 15 '08 03:10

dreamlax


2 Answers

echo あ| perl -nle 'BEGIN{binmode STDIN,":utf8"} print"[$_]"; print /[[:print:]]/ ? "YES" : "NO"'

This mostly works, though it generates a warning about a wide character. But it gives you the idea: you must be sure you're dealing with a real unicode string (check utf8::is_utf8). Or just check perlunicode at all - the whole subject still makes my head spin.

like image 173
Tanktalus Avatar answered Oct 03 '22 11:10

Tanktalus


I think you don't want or need locales for that but, but rather Unicode. If you have decoded a text string, \w will match word characters in any language, \d matches not just 0..9 but every Unicode digit etc. In regexes you can query Unicode properties with \p{PropertyName}. Particularly interesting for you might be \p{Print}. Here's a list of all the available Unicode character properties.

I wrote an article about the basics and subtleties of Unicode and Perl, it should give you a good idea on what to do that perl will recognize your string as a sequence of characters, not just a sequence of bytes.

Update: with Unicode you don't get language dependent behaviour, but instead sane defaults regardless of language. This may or may not be what you want, but for the distinction of priintable/control character I don't see why you'd need language dependent behaviour.

like image 38
moritz Avatar answered Oct 03 '22 10:10

moritz