I've got these two strings:
%EC%E0%EC%E0+%EC%FB%EB%E0+%F0%E0%EC%F3
%D0%BC%D0%B0%D0%BC%D0%B0%20%D0%BC%D1%8B%D0%BB%D0%B0%20%D1%80%D0%B0%D0%BC%D1%83
This is a url-encoded phrase in Russian in cp-1251 and utf-8 respectively. I want to see them in Russian in my utf-8 terminal using perl. Unfortunately, perl module Encode::Detect (after url-decoding) can't detect cp-1251 of the first example. Instead, it proposes this: "x-euc-tw".
The question is, what is the proper way of detecting the right encoding in this case (specifying locale parameters, using other modules...)?
Are UTF-8 and cp1251 the only two options? The odds of having cp1251 text that's also valid UTF-8 is extremely tiny. (It would be gibberish.) So you can do
use Encode qw( decode );
my $decoded = eval { decode('UTF-8', $encoded, Encode::FB_CROAK) }
// decode('cp1251', $encoded);
This will be far far more accurate that an encoding guesser.
Encode::Detect
, which uses the Mozilla universal character set detector, works by letting different character set probers look at the data. The probers then report different confidence levels and the prober with highest confidence wins. This process depends on the input only; it is not affected by locale or other external settings. In this case, for whatever reason, the prober for euc-tw is reporting a higher confidence than the prober for windows-1251, and there's nothing you can do short of changing the data or modifying the source code.
You could try using Encode::Guess
which allows specifying a list of encodings to choose from.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With