I need to normalize a string such as "quée" and I can't seem to convert the extended ASCII characters such as é, á, í, etc into roman/english versions. I've tried several different methods but nothing works so far. There is a fair amount of material on this general subject but I can't seem to find a working answer to this problem.
Here's my code:
#transliteration solution (works great with standard chars but doesn't find the
#special ones) - I've tried looking for both \x{130} and é with the same result.
$mystring =~ tr/\\x{130}/e/;
#converting into array, then iterating through and replacing the specific char
#( same result as the above solution )
my @breakdown = split( "",$mystring );
foreach ( @breakdown ) {
if ( $_ eq "\x{130}" ) {
$_ = "e";
print "\nArray Output: @breakdown\n";
}
$lowercase = join( "",@breakdown );
}
1) This article should provide a fairly good (if complicated) way.
It provides a solution to converting all accented Unicode characters into the base character + accent; once that is done you can simply remove the accent characters separately.
2) Another option is CPAN: Text::Unaccent::PurePerl
(An improved Pure Perl version of Text::Unaccent
)
3) Also, this SO answer proposes Text::Unidecode
:
$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")' ete
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With