Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalizing ASCII characters

I need to normalize a string such as "quée" and I can't seem to convert the extended ASCII characters such as é, á, í, etc into roman/english versions. I've tried several different methods but nothing works so far. There is a fair amount of material on this general subject but I can't seem to find a working answer to this problem.

Here's my code:

#transliteration solution (works great with standard chars but doesn't find the 
#special ones) - I've tried looking for both \x{130} and é with the same result.
$mystring =~ tr/\\x{130}/e/;

#converting into array, then iterating through and replacing the specific char
#( same result as the above solution )
my @breakdown = split( "",$mystring );

foreach ( @breakdown ) {
    if ( $_ eq "\x{130}" ) {
        $_ = "e";
        print "\nArray Output: @breakdown\n";
    }
    $lowercase = join( "",@breakdown );
}
like image 734
Andrew Coomes Avatar asked May 24 '12 17:05

Andrew Coomes


1 Answers

1) This article should provide a fairly good (if complicated) way.

It provides a solution to converting all accented Unicode characters into the base character + accent; once that is done you can simply remove the accent characters separately.


2) Another option is CPAN: Text::Unaccent::PurePerl (An improved Pure Perl version of Text::Unaccent)


3) Also, this SO answer proposes Text::Unidecode:

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")'
  ete
like image 113
DVK Avatar answered Nov 15 '22 07:11

DVK