Why doesn't "\w" match Unicode word characters (for example, "ğ,İ,ş,ç,ö,ü") in a Perl regular expression?
I tried to include these characters in regular expression m{\w+}g
. However, it does not match "ğ,İ,ş,ç,ö,ü".
How can I make this work?
use strict;
use warnings;
use v5.12;
use utf8;
open(MYINPUTFILE, "< $ARGV[0]");
my @strings;
my $delimiter;
my $extensions;
my $id;
while(<MYINPUTFILE>)
{
my($line) = $_;
chomp($line);
print $line."\n";
unshift(@strings,$line =~ /\w+/g);
$delimiter = /[._\s]/;
$extensions = /pdf$|doc$|docx$/;
$id = /^200|^201/;
}
foreach(@strings){
print $_."\n";
}
The input file is like:
Çidem_Şener
Hüsnü Tağlip
...
The output goes like:
H�
sn�
Ta�
lip
�
idem_�
ener
In the code, I try to read the file and take each string in the array. (Delimiter can be _
or .
or \s
).
To make the regular expressions more readable, Perl provides useful predefined abbreviations for common character classes as shown below: d matches a digit, from 0 to 9 [0-9] s matches a whitespace character, that is a space, tab, newline, carriage return, formfeed. [tnrf] w matches a “word” character (alphanumeric or _) [0-9a-zA-Z_].
If you want to match from elem0 to elem1000, you can use range operator (-) within the character classes, for examples: To make the regular expressions more readable, Perl provides useful predefined abbreviations for common character classes as shown below:
In the previous Perl regular expresssion tutorial, we’ve built regular expressions with literal strings, for example /world/. However, regular expression engine allows you to build regular expressions that represent not just only a single character sequence but also a whole class of them, for example, digits, whitespace and words.
The following table describes some of the most common special characters for use in regular expressions. These characters are categorized as follows: (caret) Matches the start of the line or string of text that the regular expression is searching. For example, a content rule with a location Subject line and the following regular expression:
Make sure that Perl is treating the data as UTF-8.
e.g. if it is embedded in the script itself:
#!/usr/bin/perl
use strict;
use warnings;
use v5.12;
use utf8; # States that the Perl program itself is saved using utf8 encoding
say "matched" if "ğİşçöü" =~ /^\w+$/;
That outputs matched. If I remove the use utf8;
line, it does not.
\w
matches any of ğ
İ
ş
ç
ö
ü
just fine.
'ğİşçöü' =~ /\A \w+ \z/msx; # true
You probably made a mistake and forgot to decode input from octets into Perl characters. I suspect your regex examines stuff on the byte level instead of the character level, like one would expect.
Read http://p3rl.org/UNI and http://training.perl.com/scripts/perlunicook.html to learn about the topic of encoding in Perl.
Edit:
The problem is likely here (I cannot tell for sure without the content of the file):
open(MYINPUTFILE, "< $ARGV[0]");
Find out the encoding of the file, perhaps it's UTF-8
or Windows-1254
. Rewrite it, e.g.:
open $in, '<:utf8', $ARGV[0];
open $in, '<:encoding(Windows-1254)', $ARGV[0];
Similarly, printing characters out to STDOUT (near the end of your program) is similarly broken because of the lack of encoding. ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding
shows one way how to do it properly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With