Normalizing ASCII characters

Question

I need to normalize a string such as "quée" and I can't seem to convert the extended ASCII characters such as é, á, í, etc into roman/english versions. I've tried several different methods but nothing works so far. There is a fair amount of material on this general subject but I can't seem to find a working answer to this problem.

Here's my code:

#transliteration solution (works great with standard chars but doesn't find the 
#special ones) - I've tried looking for both \x{130} and é with the same result.
$mystring =~ tr/\x{130}/e/;

#converting into array, then iterating through and replacing the specific char
#( same result as the above solution )
my @breakdown = split( "",$mystring );

foreach ( @breakdown ) {
    if ( $_ eq "\x{130}" ) {
        $_ = "e";
        print "
Array Output: @breakdown
";
    }
    $lowercase = join( "",@breakdown );
}

DVK · Accepted Answer

1) This article should provide a fairly good (if complicated) way.

It provides a solution to converting all accented Unicode characters into the base character + accent; once that is done you can simply remove the accent characters separately.

2) Another option is CPAN: Text::Unaccent::PurePerl (An improved Pure Perl version of Text::Unaccent)

3) Also, this SO answer proposes Text::Unidecode:

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")'
  ete

Normalizing ASCII characters

Tags:

ascii

perl

normalization

Andrew Coomes

1 Answers

DVK

Recent Activity

Donate For Us

Normalizing ASCII characters

Tags:

ascii

perl

normalization

Andrew Coomes

1 Answers

DVK

Related questions

Recent Activity

Donate For Us