I am looking for advice what library and/or function should I use to convert international text to it's English characters alternative.
For example
Vous avez aimé l'épée offerte par les elfes à Frodon
convert into
Vous avez aime l'epee offerte par les elfes a Frodon
We can remove accents from the string by using a Python module called Unidecode. This module consists of a method that takes a Unicode object or string and returns a string without ascents.
Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.
First you can decompose the characters using Unicode::Normalize, then you can use a simple regex to delete all the diacriticals. (I think simply grabbing all the non-spacing mark characters should do it, but there might be an obscure exception or two.)
Here's an example:
use strict;
use warnings;
use utf8;
use Unicode::Normalize;
my $test = "Vous avez aimé l'épée offerte par les elfes à Frodon";
my $decomposed = NFKD( $test );
$decomposed =~ s/\p{NonspacingMark}//g;
print $decomposed;
Output:
Vous avez aime l'epee offerte par les elfes a Frodon
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With