Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).

This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.

So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?

It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.

like image 701
Adrian Grigore Avatar asked Mar 12 '09 10:03

Adrian Grigore


People also ask

How do I get the ASCII value of a character in Perl?

The ord() function is an inbuilt function in Perl that returns the ASCII value of the first character of a string. This function takes a character string as a parameter and returns the ASCII value of the first character of this string.

Can UTF-8 be read as ASCII?

Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.

Is UTF-8 the same as ASCII?

For characters represented by the 7-bit ASCII character codes, the UTF-8 representation is exactly equivalent to ASCII, allowing transparent round trip migration. Other Unicode characters are represented in UTF-8 by sequences of up to 6 bytes, though most Western European characters require only 2 bytes3.

Does Perl support Unicode?

While Perl does not implement the Unicode standard or the accompanying technical reports from cover to cover, Perl does support many Unicode features. Also, the use of Unicode may present security issues that aren't obvious, see "Security Implications of Unicode" below.


1 Answers

I believe you could use Text::Unidecode for this, it is precisely what it tries to do.

like image 164
mirod Avatar answered Nov 08 '22 18:11

mirod