Does Perl's <code>\w</code> match all alphanumeric characters defined in the Unicode standard? For example, will <code>\w</code> match all (say) Chinese and Russian alphanumeric characters? I wrote a simple test script (see below) which suggests that <code>\w</code> does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive. <pre class="prettyprint"><code>#!/usr/bin/perl use utf8; binmode(STDOUT, ':utf8'); my @ok; $ok[0] = "abcdefghijklmnopqrstuvwxyz"; $ok[1] = "éèëáàåäöčśžłíżńęøáýąóæ&scaron;ćôı"; $ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατ&sigmaf;ę&sigmaf;η"; $ok[3] = "τσιαιγολοχβ&sigmaf;ανنيرحبالтераб"; $ok[4] = "иневоаслкłјиневоцедањеволс"; $ok[5] = "рглсывызтоμ&sigmaf;όκινα&sigmaf;όγο"; foreach my $ok (@ok) { die unless ($ok =~ /^\w+$/); } </code></pre>

perldoc perlunicode says <blockquote> Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. <code>\w</code> can be used to match a Japanese ideograph, for instance. </blockquote> So it looks like the answer to your question is "yes". However, you might want to use the <code>\p{}</code> construct to directly access specific Unicode character properties. You can probably use <code>\p{L}</code> (or, shorter, <code>\pL</code>) for letters and <code>\pN</code> for numbers and feel a little more confident that you'll get exactly what you want.

Yes and no. If you want all alphanumerics, you want <code>[\p{Alphabetic}\p{GC=Number}]</code>. The <code>\w</code> contains both more and less than that. It specifically excludes any <code>\pN</code> which is not <code>\p{Nd}</code> nor <code>\p{Nl}</code>, like the superscripts, subscripts, and fractions. Those are <code>\p{GC=Other_Number}</code>, and are not included in <code>\w</code>. Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a <code>\w</code> in a regex matches any single code point that has any of the following four properties: <ol> <li><code>\p{GC=Alphabetic}</code></li> <li><code>\p{GC=Mark}</code></li> <li><code>\p{GC=Connector_Punctuation}</code></li> <li><code>\p{GC=Decimal_Number}</code></li> </ol> Number 4 above can be expressed in any of these ways, which are all considered equivalent: <ul> <li> <code>\p{Digit}</code> </li> <li><code>\p{General_Category=Decimal_Number}</code></li> <li><code>\p{GC=Decimal_Number}</code></li> <li><code>\p{Decimal_Number}</code></li> <li><code>\p{Nd}</code></li> <li><code>\p{Numeric_Type=Decimal}</code></li> <li><code>\p{Nt=De}</code></li> </ul> Note that <code>\p{Digit}</code> is not the same as <code>\p{Numeric_Type=Digit}</code>. For example, code point B2, SUPERSCRIPT TWO, has only the <code>\p{Numeric_Type=Digit}</code> property and not plain <code>\p{Digit}</code>. That is because it is considered a <code>\p{Other_Number}</code> or <code>\p{No}</code>. It does, however, have the <code>\p{Numeric_Value=2}</code> property as you would imagine. It’s really point number 1 above, <code>\p{Alphabetic}</code> ,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as <code>\p{Letter}</code> (<code>\pL</code>), but it is not. Alphabetics include much more than that, all because of the <code>\p{Other_Alphabetic}</code> property, as this in turn includes some but not all <code>\p{GC=Mark}</code>, all of <code>\p{Lowercase}</code> (which is not the same as <code>\p{GC=Ll}</code> because it adds <code>\p{Other_Lowercase}</code>) and all of <code>\p{Uppercase}</code> (which is not the same as <code>\p{GC=Lu}</code> because it adds <code>\p{Other_Uppercase}</code>). That’s how it pulls in <code>\p{GC=Letter_Number}</code> like Roman numerals and also all the circled letters, which are of type <code>\p{Other_Symbol}</code> and <code>\p{Block=Enclosed_Alphanumerics}</code>. Aren’t you glad we get to use <code>\w</code>? :)

Does \w match all alphanumeric characters defined in the Unicode standard?

Tags:

Does Perl's \w match all alphanumeric characters defined in the Unicode standard?

For example, will \w match all (say) Chinese and Russian alphanumeric characters?

I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.

#!/usr/bin/perl                                                                                                                                                                                                  

use utf8;

binmode(STDOUT, ':utf8');

my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";

foreach my $ok (@ok) {
    die unless ($ok =~ /^\w+$/);
}

286

asked Apr 05 '11 17:04

knorv

2 Answers

perldoc perlunicode says

Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance.

So it looks like the answer to your question is "yes".

However, you might want to use the \p{} construct to directly access specific Unicode character properties. You can probably use \p{L} (or, shorter, \pL) for letters and \pN for numbers and feel a little more confident that you'll get exactly what you want.

179

answered Oct 11 '22 08:10

CanSpice

Yes and no.

If you want all alphanumerics, you want [\p{Alphabetic}\p{GC=Number}]. The \w contains both more and less than that. It specifically excludes any \pN which is not \p{Nd} nor \p{Nl}, like the superscripts, subscripts, and fractions. Those are \p{GC=Other_Number}, and are not included in \w.

Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a \w in a regex matches any single code point that has any of the following four properties:

\p{GC=Alphabetic}
\p{GC=Mark}
\p{GC=Connector_Punctuation}
\p{GC=Decimal_Number}

Number 4 above can be expressed in any of these ways, which are all considered equivalent:

\p{Digit}
\p{General_Category=Decimal_Number}
\p{GC=Decimal_Number}
\p{Decimal_Number}
\p{Nd}
\p{Numeric_Type=Decimal}
\p{Nt=De}

Note that \p{Digit} is not the same as \p{Numeric_Type=Digit}. For example, code point B2, SUPERSCRIPT TWO, has only the \p{Numeric_Type=Digit} property and not plain \p{Digit}. That is because it is considered a \p{Other_Number} or \p{No}. It does, however, have the \p{Numeric_Value=2} property as you would imagine.

It’s really point number 1 above, \p{Alphabetic} ,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as \p{Letter} (\pL), but it is not.

Alphabetics include much more than that, all because of the \p{Other_Alphabetic} property, as this in turn includes some but not all \p{GC=Mark}, all of \p{Lowercase} (which is not the same as \p{GC=Ll} because it adds \p{Other_Lowercase}) and all of \p{Uppercase} (which is not the same as \p{GC=Lu} because it adds \p{Other_Uppercase}).

That’s how it pulls in \p{GC=Letter_Number} like Roman numerals and also all the circled letters, which are of type \p{Other_Symbol} and \p{Block=Enclosed_Alphanumerics}.

Aren’t you glad we get to use \w? :)

answered Oct 11 '22 06:10

tchrist

Related questions
                            
                                How to use replaceAll() in Javascript.........................? [duplicate]
                            
                                Serialize one to many relationships in Json.net
                            
                                EntityFramework - Where is the connection string?
                            
                                How does a red-black tree work?
                            
                                Why does my JAR file execute at CMD, but not on double-click?
                            
                                A function inside an if structure
                            
                                How to get more detailed backtrace [duplicate]
                            
                                QR Code generation in shell / mac terminal
                            
                                How do I check whether a string exists in an array?
                            
                                Equivalent of Scala "case class" in F#
                            
                                How to tell if the output of the "find" command is empty?
                            
                                Equation-driven smoothly shaded concentric shapes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With