Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex for uppercase Unicode does not match "Ó"?

It seems that it does not recognize the accented Ó as uppercase

#!/usr/bin/env perl
use strict;
use warnings;
use 5.14.0;
use utf8;
use feature 'unicode_strings';

" SIMÓN " =~ /^\s+(\p{Upper}+)/u;
print "$1\n";

returns

SIM

Perl should be able to use Unicode data, which already tags Ó as uppercase. From emacs describe-char

character code properties: customize what to show
  name: LATIN CAPITAL LETTER O WITH ACUTE
  old-name: LATIN CAPITAL LETTER O ACUTE
  general-category: Lu (Letter, Uppercase)
  decomposition: (79 769) ('O' '́')
like image 382
user525602 Avatar asked Jun 05 '12 04:06

user525602


People also ask

Can regex match the first expression or the second?

The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line. Now, let's look at each expression in the alternation.

How do you match words with all numbers and uppercase letters?

We still need to match words consisting of all numbers and uppercase letters. That is handled by a relatively small portion of the second expression in the alternation: \b [A-Z0-9]+\b. The \b s represent word boundaries, and the [A-Z0-9]+ matches one or more numbers and capital letters together.

How do I match a character in UTF-16?

This means that you have to list out every range of UTF-16 code units that corresponds to a character that you want to match. A quick and dirty solution might be to say [a-zA-Z\u0080-\uFFFF], or in full: This will match any letter in the ASCII range, but will also match any character at all that is outside the ASCII range.

How to include Unicode characters in JavaScript?

You would need to use \p {L} to match any letter character if you want to include unicode. Speaking unicode, alternative of \w is [\p {L}\p {N}_] then. Show activity on this post. Update: As of ES2018, JavaScript supports Unicode property escapes such as \p {L}, which matches anything that Unicode considers to be a letter.


1 Answers

You're missing use open ':std', ':locale'; to properly encode your output.

If that doesn't work, your file isn't encoded using UTF-8 even though you tell Perl it is.

like image 130
ikegami Avatar answered Nov 10 '22 21:11

ikegami