Perl regex choking on multiple instances of character sets

Question

I started out with some crazy failures using preg_replace in php and boiled it down to the problem case of having more than one character class using turkish dotted "i" and undotted "ı" together. Here is a simple test case in php:

<?php
    echo 'match single normal i: ';
    $str = 'mi';
    echo (preg_match('!m[ıi]!', $str)) ? "ok
" : "fail
";

    echo 'match single undotted ı: ';
    $str = 'mı';
    echo (preg_match('!m[ıi]!', $str)) ? "ok
" : "fail
";

    echo 'match double normal i: ';
    $str = 'misir';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok
" : "fail
";

    echo 'match double undotted ı: ';
    $str = 'mısır';
    echo (preg_match('!m[ıi]s[ıi]r!', $str)) ? "ok
" : "fail
";
?>

And the same test case again in perl:

#!/usr/bin/perl

$str = 'mi';
$str =~ m/m[ıi]/ && print "match single normal i
";

$str = 'mı';
$str =~ m/m[ıi]/ && print "match single undotted ı
";

$str = 'misir';
$str =~ m/m[ıi]s[ıi]r/ && print "match double normal i
";

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı
";

The first three tests work fine. The last one does not match.

Why does this work fine as a character class once but not the second time in the same expression? How do I write an expression to match for a word like this that needs to match no matter what combinations of letters it is written with?

Edit: Background on the language problem I'm trying to program for.

Edit 2: Adding a use utf8; directive does fix the perl version. Since my original problem was with a php program and I only switched to perl to see if it was a bug in php, that doesn't help me a whole lot. Does anybody know the directive to make PHP not choke on this?

Adrian Pronk · Accepted Answer

You may need to tell Perl that your source file contains utf8 characters. Try:

#!/usr/bin/perl

use utf8;   # **** Add this line

$str = 'mısır';
$str =~ m/m[ıi]s[ıi]r/ && print "match double undotted ı
";

Which doesn't help you with PHP but there may be a similar directive in PHP. Otherwise, try using some form of escape-sequence to avoid putting the literal character in your source-code. I know nothing about PHP so I can't help with that.

Edit
I'm reading that PHP has no Unicode support. Therefore, the unicode input you pass it is likely treated as the string of bytes that the unicode was encoded as.

If you can be assured that your input is coming in as utf-8 then you can match for the utf-8 sequence for ı which is \xc4 \xb1 as in:

$str = 'mısır';  # Make sure this source-file is encoded as utf-8 or this match will fail
echo (preg_match('!m(i|\xc4\xb1)s(i|\xc4\xb1)r!', $str)) ? "ok
" : "fail
";

Does that work?

Edit again:
I can explain why your first three tests pass. Let's pretend that in your encoding, ı is encoded as ABCDE. then PHP sees the following:

echo 'match single normal i: ';
$str = 'mi';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok
" : "fail
";

echo 'match single undotted ABCDE: ';
$str = 'mABCDE';
echo (preg_match('!m[ABCDEi]!', $str)) ? "ok
" : "fail
";

echo 'match double normal i: ';
$str = 'misir';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok
" : "fail
";

echo 'match double undotted ABCDE: ';
$str = 'mABCDEsABCDEr';
echo (preg_match('!m[ABCDEi]s[ABCDEi]r!', $str)) ? "ok
" : "fail
";

which makes it obvious why the first three tests pass and the last one fails. If you use a start/end anchor ^...$ I think you'll find that only the first test passes.

Perl regex choking on multiple instances of character sets

Tags:

regex

php

unicode

perl

turkish

Caleb

1 Answers

Adrian Pronk

Recent Activity

Donate For Us

Perl regex choking on multiple instances of character sets

Tags:

regex

php

unicode

perl

turkish

Caleb

1 Answers

Adrian Pronk

Related questions

Recent Activity

Donate For Us