Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to encode accented and other foreign characters to UTF8 in perl

I've tried everywhere to learn and get my head around this but I haven't gotten anywhere.

Lets take a string:

Macaroon dessert muffin. Sugar plum cookie macaroon soufflé lollipop candy brownie tiramisu croissant. Wafer ice cream chocolate bar gummies. Cheesecake powder chupa chups. Donut pastry candy canes. Liquorice tootsie roll candy canes jelly-o. Sesame snaps applicake sugar plum cupcake apple pie. Chocolate ice cream cotton candy soufflé. Apple pie danish unerdwear.com wafer unerdwear.com muffin applicake pudding. Jelly cotton candy brownie lollipop macaroon sweet roll carrot cake chocolate bar. Tart lollipop cookie unerdwear.com gummies powder. Jelly halvah apple pie pudding caramels marzipan. Marzipan jelly-o topping pie powder icing. Gummies jelly-o tiramisu bear claw brownie cheesecake. Icing pie oat cake lollipop carrot cake toffee. Donut jelly sugar plum muffin. Fruitcake tiramisu jujubes muffin tart jelly-o pie fruitcake. Unerdwear.com jujubes unerdwear.com gummi bears jelly beans brownie macaroon. Marzipan halvah cake tootsie roll cotton candy cotton candy donut. Soufflé wafer candy canes carrot cake. Cheesecake muffin powder gummies carrot cake. Halvah ice cream applicake liquorice macaroon apple pie cupcake. Cake dragée liquorice. Sugar plum biscuit halvah. Carrot cake candy canes sweet candy. Candy canes marzipan marshmallow danish cake jelly-o brownie cookie oat cake.

When I do:

Encode::encode('UTF-8', $text);

on that string, the word Soufflé gets encoded to Soufflé.

When I look at this, I don't recognise it as any code point or as any valid encoding mapping (i.e. é). How am I to expect it to reach its destination in a way that it can be read correctly? In other words, why does Perl give me é when I've encoded it as UTF8 and it should have given me é?

xmlbody($text);

sub xmlbody {
    $description = shift;

    use XML::Writer;
    my $writer = XML::Writer->new( OUTPUT => 'self', ENCODING => 'utf-8' );
    my $writer->xmlDecl('utf-8');

    ## ...structure

    $writer->cdataElement('description',$description);

    ## ...more structure

    $writer->end();
}

use utf8; doesn't seem to be encoding the special characters in the above mentioned string, it still gives "é". Would having $writer->xmlDecl('utf-8') be the equivalent of use open qw(:std :utf8), since I'm not using a filehandle or stdin/stdout?

like image 570
a7omiton Avatar asked Oct 31 '22 20:10

a7omiton


1 Answers

When I look at this, I don't recognise it as any code point or as any valid encoding mapping

If you look at the relevant page on fileformat.info, you'll see what is happening.

Initially, in your program you have a Unicode character "é". The Unicode character code for that is U+00E9. When you encode that character as UTF-8, you get a character that consists of two bytes - 0xC3 0xA9. If you look at the codepage for ISO-8859-1, you'll see that 0xC3 is "Ã" and 0xA9 is "©".

If you try to display that two-byte character on a device that understands UTF-8 and is expecting UTF-8 then you'll get "é". Otherwise the device will use its native character encoding (which is likely to be ISO-8859-1) and you'll get the Mojibake that you've seen.

As tchrist says, the easiest way to handle this is to use Perl's tools that take care of it without you having to think about it.

like image 185
Dave Cross Avatar answered Nov 15 '22 07:11

Dave Cross