Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing a file consisting of both UTF-8 and Windows-1252

I have an application that produces a UTF-8 file, but some of the contents are incorrectly encoded. Some of the characters are encoded as iso-8859-1 aka iso-latin-1 or cp1252 aka Windows-1252. Is there a way of recovering the original text?

like image 204
ikegami Avatar asked Feb 23 '15 19:02

ikegami


People also ask

How do I change the encoding from Windows-1252 to UTF-8?

Just open up the windows-1252 encoded file in Notepad, then choose 'Save as' and set encoding to UTF-8.

What is the difference between Windows-1252 and UTF-8?

Windows-1252 is a subset of UTF-8 in terms of 'what characters are available', but not in terms of their byte-by-byte representation. Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8.

What is Windows-1252 encoding?

Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German.


2 Answers

Yes!

Obviously, it's better to fix the program creating the file, but that's not always possible. What follows are two solutions.

A line can contain a mix of encodings

Encoding::FixLatin provides a function named fix_latin which decodes text that consists of a mix of UTF-8, iso-8859-1, cp1252 and US-ASCII.

$ perl -e'
   use Encoding::FixLatin qw( fix_latin );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = fix_latin($bytes);
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A

Heuristics are employed, but they are fairly reliable. Only the following cases will fail:

  • One of
    [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
    encoded using iso-8859-1 or cp1252, followed by one of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
    encoded using iso-8859-1 or cp1252.

  • One of
    [àáâãäåæçèéêëìíîï]
    encoded using iso-8859-1 or cp1252, followed by two of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
    encoded using iso-8859-1 or cp1252.

  • One of
    [ðñòóôõö÷]
    encoded using iso-8859-1 or cp1252, followed by two of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
    encoded using iso-8859-1 or cp1252.

The same result can be produced using core module Encode, though I imagine this is a fair bit slower than Encoding::FixLatin with Encoding::FixLatin::XS installed.

$ perl -e'
   use Encode qw( decode_utf8 encode_utf8 decode );
   $bytes = "\xD0 \x92 \xD0\x92\n";
   $text = decode_utf8($bytes, sub { encode_utf8(decode("cp1252", chr($_[0]))) });
   printf("U+%v04X\n", $text);
'
U+00D0.0020.2019.0020.0412.000A

Each line only uses one encoding

fix_latin works on a character level. If it's known that each line is entirely encoded using one of UTF-8, iso-8859-1, cp1252 or US-ASCII, you could make the process even more reliable by check if the line is valid UTF-8.

$ perl -e'
   use Encode qw( decode );
   for $bytes ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
      if (!eval {
         $text = decode("UTF-8", $bytes, Encode::FB_CROAK|Encode::LEAVE_SRC);
         1  # No exception
      }) {
         $text = decode("cp1252", $bytes);
      }

      printf("U+%v04X\n", $text);
   }
'
U+00D0.0020.2019.0020.00D0.2019.000A
U+0412.000A

Heuristics are employed, but they are very reliable. They will only fail if all of the following are true for a given line:

  • The line is encoded using iso-8859-1 or cp1252,

  • At least one of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷]
    is present in the line,

  • All instances of
    [ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß]
    are always followed by exactly one of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

  • All instances of
    [àáâãäåæçèéêëìíîï]
    are always followed by exactly two of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

  • All instances of
    [ðñòóôõö÷]
    are always followed by exactly three of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿],

  • None of
    [øùúûüýþÿ]
    are present in the line, and

  • None of
    [€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ<NBSP>¡¢£¤¥¦§¨©ª«¬<SHY>®¯°±²³´µ¶·¸¹º»¼½¾¿]
    are present in the line except where previously mentioned.


Notes:

  • Encoding::FixLatin installs command line tool fix_latin to convert files, and it would be trivial to write one using the second approach.
  • fix_latin (both the function and the file) can be sped up by installing Encoding::FixLatin::XS.
  • The same approach can be used for mixes of UTF-8 with other single-byte encodings. The reliability should be similar, but it can vary.
like image 120
ikegami Avatar answered Sep 27 '22 23:09

ikegami


This is one of the reasons I wrote Unicode::UTF8. With Unicode::UTF8 this is trivial using the fallback option in Unicode::UTF8::decode_utf8().

use Unicode::UTF8 qw[decode_utf8];
use Encode        qw[decode];

print "UTF-8 mixed with Latin-1 (ISO-8859-1):\n";
for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
    no warnings 'utf8';
    printf "U+%v04X\n", decode_utf8($octets, sub { $_[0] });
}

print "\nUTF-8 mixed with CP-1252 (Windows-1252):\n";
for my $octets ("\xD0 \x92 \xD0\x92\n", "\xD0\x92\n") {
    no warnings 'utf8';
    printf "U+%v04X\n", decode_utf8($octets, sub { decode('CP-1252', $_[0]) });
}

Output:

UTF-8 mixed with Latin-1 (ISO-8859-1):
U+00D0.0020.0092.0020.0412.000A
U+0412.000A

UTF-8 mixed with CP-1252 (Windows-1252):
U+00D0.0020.2019.0020.0412.000A
U+0412.000A

Unicode::UTF8 is written in C/XS and only invokes the callback/fallback when encountering an Ill-formed UTF-8 sequence.

like image 33
chansen Avatar answered Sep 27 '22 23:09

chansen