Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The proper way of encoding detection in perl

I've got these two strings:

%EC%E0%EC%E0+%EC%FB%EB%E0+%F0%E0%EC%F3
%D0%BC%D0%B0%D0%BC%D0%B0%20%D0%BC%D1%8B%D0%BB%D0%B0%20%D1%80%D0%B0%D0%BC%D1%83

This is a url-encoded phrase in Russian in cp-1251 and utf-8 respectively. I want to see them in Russian in my utf-8 terminal using perl. Unfortunately, perl module Encode::Detect (after url-decoding) can't detect cp-1251 of the first example. Instead, it proposes this: "x-euc-tw".

The question is, what is the proper way of detecting the right encoding in this case (specifying locale parameters, using other modules...)?

like image 673
Igor Shalyminov Avatar asked Feb 19 '23 17:02

Igor Shalyminov


2 Answers

Are UTF-8 and cp1251 the only two options? The odds of having cp1251 text that's also valid UTF-8 is extremely tiny. (It would be gibberish.) So you can do

use Encode qw( decode );
my $decoded = eval { decode('UTF-8', $encoded, Encode::FB_CROAK) }
    // decode('cp1251', $encoded);

This will be far far more accurate that an encoding guesser.

like image 192
ikegami Avatar answered Feb 27 '23 16:02

ikegami


Encode::Detect, which uses the Mozilla universal character set detector, works by letting different character set probers look at the data. The probers then report different confidence levels and the prober with highest confidence wins. This process depends on the input only; it is not affected by locale or other external settings. In this case, for whatever reason, the prober for euc-tw is reporting a higher confidence than the prober for windows-1251, and there's nothing you can do short of changing the data or modifying the source code.

You could try using Encode::Guess which allows specifying a list of encodings to choose from.

like image 29
Joni Avatar answered Feb 27 '23 16:02

Joni