Why doesn't "\w" match Unicode word characters (for example, "ğ,İ,ş,ç,ö,ü") in Perl regular expression?

Tags:

Why doesn't "\w" match Unicode word characters (for example, "ğ,İ,ş,ç,ö,ü") in a Perl regular expression?

I tried to include these characters in regular expression m{\w+}g. However, it does not match "ğ,İ,ş,ç,ö,ü".

How can I make this work?

use strict;
use warnings;
use v5.12;
use utf8;

open(MYINPUTFILE, "< $ARGV[0]");

my @strings;
my $delimiter;
my $extensions;
my $id;

while(<MYINPUTFILE>)
{
    my($line) = $_;
    chomp($line);
    print $line."\n";
    unshift(@strings,$line =~ /\w+/g);
    $delimiter = /[._\s]/;
    $extensions = /pdf$|doc$|docx$/;
    $id = /^200|^201/;
}

foreach(@strings){
    print $_."\n";
}

The input file is like:

Çidem_Şener
Hüsnü Tağlip
...

The output goes like:

H�

sn�

Ta�

lip

�

idem_�

ener

In the code, I try to read the file and take each string in the array. (Delimiter can be _ or . or \s).

620

asked Mar 15 '12 17:03

2 Answers

Make sure that Perl is treating the data as UTF-8.

e.g. if it is embedded in the script itself:

#!/usr/bin/perl

use strict;
use warnings; 
use v5.12;
use utf8;   # States that the Perl program itself is saved using utf8 encoding

say "matched" if "ğİşçöü" =~ /^\w+$/;

That outputs matched. If I remove the use utf8; line, it does not.

answered Nov 15 '22 05:11

Quentin

\w matches any of ğ İ ş ç ö ü just fine.

'ğİşçöü' =~ /\A \w+ \z/msx;     # true

You probably made a mistake and forgot to decode input from octets into Perl characters. I suspect your regex examines stuff on the byte level instead of the character level, like one would expect.

Read http://p3rl.org/UNI and http://training.perl.com/scripts/perlunicook.html to learn about the topic of encoding in Perl.

Edit:

The problem is likely here (I cannot tell for sure without the content of the file):

open(MYINPUTFILE, "< $ARGV[0]");

Find out the encoding of the file, perhaps it's UTF-8 or Windows-1254. Rewrite it, e.g.:

open $in, '<:utf8', $ARGV[0];
open $in, '<:encoding(Windows-1254)', $ARGV[0];

Similarly, printing characters out to STDOUT (near the end of your program) is similarly broken because of the lack of encoding. ℞ 16: Declare STD{IN,OUT,ERR} to be in locale encoding shows one way how to do it properly.

answered Nov 15 '22 03:11

daxim

Related questions
                            
                                JavaScript-Regular Expressions
                            
                                What a quick way to clean up a monetary string [duplicate]
                            
                                RegEx ignore text inside quoted strings in .net
                            
                                Replace repeating strings in a string
                            
                                approximate RegEx in python with TRE: strange unicode behavior
                            
                                Javascript regex replace function
                            
                                perl: best way to match, save and replace a regex globally
                            
                                More efficient word boundary query in mySQL
                            
                                Regex in .NET: joining duplicate named captured groups
                            
                                Regular expression on voxel space
                            
                                Looking to extract data between parentheses in a string via MYSQL
                            
                                Regex.Split() sentence to words while preserving whitespace
                            
                                How do you map regex string replacement values ($1,$2 etc) to a hash?
                            
                                Easier way to extract a substring in Javascript
                            
                                Replace words in a string, but ignore HTML
                            
                                Emacs Lisp: matching a repeated pattern in a compact manner?
                            
                                RegExp, Remove dots in tags
                            
                                How do you create a string to match an regex?
                            
                                How to add multiline option on RegularExpression attribute?
                            
                                Merging two arrays, overwriting first array with second one

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why doesn't "\w" match Unicode word characters (for example, "ğ,İ,ş,ç,ö,ü") in Perl regular expression?

Tags:

regex

unicode

perl

erogol

People also ask

2 Answers

Quentin

daxim

Recent Activity

Donate For Us