It seems that it does not recognize the accented Ó as uppercase
#!/usr/bin/env perl
use strict;
use warnings;
use 5.14.0;
use utf8;
use feature 'unicode_strings';
" SIMÓN " =~ /^\s+(\p{Upper}+)/u;
print "$1\n";
returns
SIM
Perl should be able to use Unicode data, which already tags Ó as uppercase.
From emacs describe-char
character code properties: customize what to show
name: LATIN CAPITAL LETTER O WITH ACUTE
old-name: LATIN CAPITAL LETTER O ACUTE
general-category: Lu (Letter, Uppercase)
decomposition: (79 769) ('O' '́')
The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line. Now, let's look at each expression in the alternation.
We still need to match words consisting of all numbers and uppercase letters. That is handled by a relatively small portion of the second expression in the alternation: \b [A-Z0-9]+\b. The \b s represent word boundaries, and the [A-Z0-9]+ matches one or more numbers and capital letters together.
This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match. A quick and dirty solution might be to say [a-zA-Z\u0080-\uFFFF], or in full: This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range.
You would need to use \p {L} to match any letter character if you want to include unicode. Speaking unicode, alternative of \w is [\p {L}\p {N}_] then. Show activity on this post. Update: As of ES2018, JavaScript supports Unicode property escapes such as \p {L}, which matches anything that Unicode considers to be a letter.
You're missing use open ':std', ':locale';
to properly encode your output.
If that doesn't work, your file isn't encoded using UTF-8 even though you tell Perl it is.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With