How to match string with diacritic in perl?

Tags:

For example, match "Nation" in ""Îñţérñåţîöñåļîžåţîöñ" without extra modules. Is it possible in new Perl versions (5.14, 5.15 etc)?

I found an answer! Thanks to tchrist

Rigth solution with UCA match (thnx to https://stackoverflow.com/users/471272/tchrist).

# found start/end offsets for matched utf-substring (without intersections) use 5.014; use strict;  use warnings; use utf8; use Unicode::Collate; binmode STDOUT, ':encoding(UTF-8)'; my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2; my $look = "Nation"; my $Collator = Unicode::Collate->new(     normalization => undef, level => 1    );  my @match = $Collator->match($str, $look); if (@match) {     my $found = $match[0];     my $f_len  = length($found);     say "match result: $found (length is $f_len)";      my $offset = 0;     while ((my $start = index($str, $found, $offset)) != -1) {                                                           my $end   = $start + $f_len;         say sprintf("found at: %s,%s", $start, $end);         $offset = $end + 1;     } }

Wrong (but working) solution from http://www.perlmonks.org/?node_id=485681

Magic piece of code is:

    $str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;

code example:

    use 5.014;     use utf8;     use Unicode::Normalize;      binmode STDOUT, ':encoding(UTF-8)';     my $str  = "Îñţérñåţîöñåļîžåţîöñ";     my $look = "Nation";     say "before: $str\n";     $str = NFD($str);     # M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html)     $str =~ s/\pM//og; # remove "marks"     say "after: $str";¬     say "is_match: ", $str =~ /$look/i || 0;

707

asked Sep 15 '11 11:09

nordicdyno

2 Answers

Right solution with UCA (thnx to tchrist):

# found start/end offsets for matched s use 5.014; use utf8; use Unicode::Collate; binmode STDOUT, ':encoding(UTF-8)'; my $str  = "Îñţérñåţîöñåļîžåţîöñ" x 2; my $look = "Nation"; my $Collator = Unicode::Collate->new(     normalization => undef, level => 1    );  my @match = $Collator->match($str, $look); say "match ok!" if @match;

P.S. "Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment." © tchrist Why does modern Perl avoid UTF-8 by default?

145

answered Oct 06 '22 22:10

nordicdyno

What do you mean by "without extra modules"?

Here is a solution with use Unicode::Normalize; see on perl doc

I removed the "ţ" and the "ļ" from your string, my eclipse didn't wanted to save the script with them.

use strict; use warnings; use UTF8; use Unicode::Normalize;  my $str = "Îñtérñåtîöñålîžåtîöñ";  for ( $str ) {  # the variable we work on    ##  convert to Unicode first    ##  if your data comes in Latin-1, then uncomment:    #$_ = Encode::decode( 'iso-8859-1', $_ );      $_ = NFD( $_ );   ##  decompose    s/\pM//g;         ##  strip combining characters    s/[^\0-\x80]//g;  ##  clear everything else  }  if ($str =~ /nation/) {   print $str . "\n"; }

The output is

Internationaliation

The "ž" is removed from the string, it seems not to be a composed character.

The code for the for loop is from this side How to remove diacritic marks from characters

Another interesting read is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) from Joel Spolsky

Update:

As @tchrist pointed out, there is a algorithm existing, that is better suited, called UCA (Unicode Collation Algorithm). @nordicdyno, already provided a implementation in his question.

The algorithm is described here Unicode Technical Standard #10, Unicode Collation Algorithm

the perl module is described here on perldoc.perl.org

answered Oct 07 '22 00:10

stema

Related questions
                            
                                Good practices regarding template specialization and inheritance
                            
                                Best way to monitor file system changes in linux
                            
                                New method added in javax.sql.CommonDataSource in 1.7
                            
                                Entity Framework 4.2 exec sp_executesql does not use indexes (parameter sniffing)
                            
                                How to share someone's post using Facebook Graph API
                            
                                Django Template Test Coverage
                            
                                In Java how do I find out what languages I have available my Resource Bundle
                            
                                Will iOS wake up the terminated app if it's registered with location for UIBackgroundModes?
                            
                                what is libcore and its role in android?
                            
                                What is the best way of renaming (alias/forward) a function in C++?
                            
                                Google Play Developer change home currency
                            
                                Approximating the sine function with a neural network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to match string with diacritic in perl?

Tags:

nordicdyno

People also ask

2 Answers

nordicdyno

stema

Recent Activity

Donate For Us