Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't "\w" match Unicode word characters (for example, "ğ,İ,ş,ç,ö,ü") in Perl regular expression?

Why doesn't "\w" match Unicode word characters (for example, "ğ,İ,ş,ç,ö,ü") in a Perl regular expression?

I tried to include these characters in regular expression m{\w+}g. However, it does not match "ğ,İ,ş,ç,ö,ü".

How can I make this work?

use strict;
use warnings;
use v5.12;
use utf8;

open(MYINPUTFILE, "< $ARGV[0]");

my @strings;
my $delimiter;
my $extensions;
my $id;

while(<MYINPUTFILE>)
{
    my($line) = $_;
    chomp($line);
    print $line."\n";
    unshift(@strings,$line =~ /\w+/g);
    $delimiter = /[._\s]/;
    $extensions = /pdf$|doc$|docx$/;
    $id = /^200|^201/;
}

foreach(@strings){
    print $_."\n";
}

The input file is like:

Çidem_Şener
Hüsnü Tağlip
...

The output goes like:

H�

sn�

Ta�

lip

�

idem_�

ener

In the code, I try to read the file and take each string in the array. (Delimiter can be _ or . or \s).

like image 620
erogol Avatar asked Mar 15 '12 17:03

erogol


People also ask

How to use abbreviations in regular expressions in Perl?

To make the regular expressions more readable, Perl provides useful predefined abbreviations for common character classes as shown below: d matches a digit, from 0 to 9 [0-9] s matches a whitespace character, that is a space, tab, newline, carriage return, formfeed. [tnrf] w matches a “word” character (alphanumeric or _) [0-9a-zA-Z_].

How to match from elem0 to elem1000 in Perl?

If you want to match from elem0 to elem1000, you can use range operator (-) within the character classes, for examples: To make the regular expressions more readable, Perl provides useful predefined abbreviations for common character classes as shown below:

What is regular expression engine in Perl?

In the previous Perl regular expresssion tutorial, we’ve built regular expressions with literal strings, for example /world/. However, regular expression engine allows you to build regular expressions that represent not just only a single character sequence but also a whole class of them, for example, digits, whitespace and words.

What special characters can be used in regular expressions?

The following table describes some of the most common special characters for use in regular expressions. These characters are categorized as follows: (caret) Matches the start of the line or string of text that the regular expression is searching. For example, a content rule with a location Subject line and the following regular expression:


2 Answers

Make sure that Perl is treating the data as UTF-8.

e.g. if it is embedded in the script itself:

#!/usr/bin/perl

use strict;
use warnings; 
use v5.12;
use utf8;   # States that the Perl program itself is saved using utf8 encoding

say "matched" if "ğİşçöü" =~ /^\w+$/;

That outputs matched. If I remove the use utf8; line, it does not.

like image 67
Quentin Avatar answered Nov 15 '22 05:11

Quentin


\w matches any of ğ İ ş ç ö ü just fine.

'ğİşçöü' =~ /\A \w+ \z/msx;     # true

You probably made a mistake and forgot to decode input from octets into Perl characters. I suspect your regex examines stuff on the byte level instead of the character level, like one would expect.

Read http://p3rl.org/UNI and http://training.perl.com/scripts/perlunicook.html to learn about the topic of encoding in Perl.


Edit:

The problem is likely here (I cannot tell for sure without the content of the file):

open(MYINPUTFILE, "< $ARGV[0]");

Find out the encoding of the file, perhaps it's UTF-8 or Windows-1254. Rewrite it, e.g.:

open $in, '<:utf8', $ARGV[0];
open $in, '<:encoding(Windows-1254)', $ARGV[0];

Similarly, printing characters out to STDOUT (near the end of your program) is similarly broken because of the lack of encoding. ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding shows one way how to do it properly.

like image 38
daxim Avatar answered Nov 15 '22 03:11

daxim