Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two Unicode strings with perl

Tags:

unicode

perl

When I run the following code, it does not enter the "do something here" section:

my $a ='µ╫P[┐╬♣3▀═<+·1╪מ└╖"ª';
my $b ='µ╫P[┐╬♣3▀═<+·1╪מ└╖"ª';

if ($a ne $b) {
    # do something here    
}

Is there another way to compare Unicode strings with perl?

like image 442
smith Avatar asked Mar 05 '12 21:03

smith


People also ask

How do I compare strings in Perl?

“Many options exist in PERL to compare string values. One way is to use the “cmp” operator, and another way is to use comparison operators, which are “eq,” “ne,” “lt.” and “gt.” The “==” operator is used for number comparison only in PERL.

How do you check if a string equals another string in Perl?

'eq' operator in Perl is one of the string comparison operators used to check for the equality of the two strings. It is used to check if the string to its left is stringwise equal to the string to its right.

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.

What is == in Perl?

"== does a numeric comparison: it converts both arguments to a number and then compares them."


1 Answers

If you have two Unicode strings (i.e. string of Unicode code points), then you have surely saved your file as UTF-8 and you actually had

use utf8;  # Tell Perl source code is UTF-8.

my $a = 'µ╫P[┐╬♣3▀═<+·1╪מ└╖"ª';
my $b = 'µ╫P[┐╬♣3▀═<+·1╪מ└╖"ª';

if ($a eq $b) {
    print("They're equal.\n");
} else {
    print("They're not equal.\n");
}

And that works perfectly fine. eq and ne will compare the strings code point by code point.

Certain graphemes (e.g. "é") can be built multiple different ways, so you might have to normalize their representation first.

use utf8;  # Tell Perl source code is UTF-8.

use charnames          qw( :full );  # For \N{}
use Unicode::Normalize qw( NFC );

my $a = NFC("\N{LATIN SMALL LETTER E WITH ACUTE}");
my $b = NFC("e\N{COMBINING ACUTE ACCENT}");

if ($a eq $b) {
    print("They're equal.\n");
} else {
    print("They're not equal.\n");
}

Finally, Unicode considers certain characters almost equivalent, and they can be considered equal using a different form of normalization.

use utf8;  # Tell Perl source code is UTF-8.

use charnames          qw( :full );  # For \N{}
use Unicode::Normalize qw( NFKC );

my $a = NFKC("2");
my $b = NFKC("\N{SUPERSCRIPT TWO}");

if ($a eq $b) {
    print("They're equal.\n");
} else {
    print("They're not equal.\n");
}
like image 183
ikegami Avatar answered Nov 22 '22 21:11

ikegami