Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Questions about converting a mixed-encoding file to UTF8 in Perl

I'm in the process of converting files generated by the ancient DOS-based library program of our university's Chinese Studies Department into something more useful and accesible.

Among the problems I'm dealing with is that the exported text files (about 80MB in size) are in mixed encoding. I'm on Windows.

German umlauts and other higher-ASCII characters are encoded in cp1252, I think and CJK-characters in GB18030. Due to "overlapping" encodings, I can't just drag the whole file into Word or something and let it do the conversion, because I will get something like this:

orig:

+Autor:
-Yan, Lianke / ÑÖÁ¬¿Æ      # encoded Chinese characters
+Co-Autor:
-Min, Jie / (šbers.)       # encoded German U-umlaut (Ü)

result:

+Autor:
-Yan, Lianke / 阎连科       # good
+Co-Autor:
-Min, Jie / (歜ers.)       # bad... (should be: "Übers.")

So I wrote a script with several subroutines that converts non-ASCII characters in several steps. It does the following things (among others):

  1. replace some higher-order ASCII characters (š, á, etc.) with alphanumeric codes (unlikely to naturally appear anywhere else in the file). Ex.: -Min, Jie / (šbers.) -> -Min, Jie / (uumlautgrossbers.)
    Note: I did the "conversion table" by hand, so I only took the special characters actually appearing in my document into consideration. The conversion is thus not fully complete, but yields adequate results in my case, as our books are mostly in German, English and Chinese, with only very few in languages such as Italian, Spanish, French, etc and almost none in Czech etc.

  2. replace á, £, ¢, ¡, í with alphanumeric codes only if they are not preceded or followed by another character in the high ASCII-range \x80-\xFF. (these are the cp1252 encoded versions of ß, ú, ó, í and "small nordic o with cross-stroke" and appear both in cp1252- and GB18030-encoded strings.)

  3. read the whole file in and convert it from GB18030 to UTF8, thus converting encoded Chinese characters in real Chinese characters.

  4. Convert the alphanumeric codes back to their Unicode equivalents.

Although the script mostly works, the following problem arises:

  • After converting the original 80MB file, Notepad++ still thinks it is an ANSI file and displays it as such. I need to press "Encoding->Encode in UTF-8" in order to display it correctly.

What I'd like to know is:

  1. Generally, is there a better approach to convert a mixed-encoding file into UTF-8?

  2. If not, should i use use utf8 so that I can directly input the characters instead of their hex-representation in the codes2char subroutine?

  3. Would a BOM at the beginning of the file solve the problem of NP++ displaying it initially as an ANSI file? If so, how should I modify my script so that the output file has a BOM?

  4. After the conversion I may want to call some more subroutines (e.g. to convert the whole file to CSV or ODS format). Do I need to continue using the opening statement from the codes2char subroutine?

The code is composed of several subroutines which are called at the end:

!perl -w
use strict; 
use warnings;
use Encode qw(decode encode); 
use Encode::HanExtra;

our $input = "export.txt";
our $output = "export2.txt";

sub switch_var {                # switch Input and Output file between steps
    ($input, $output) = ($output, $input);
}

sub specialchars2codes {
open our $in, "<$input" or die "$!\n"; 
open our $out, ">$output" or die "$!\n"; 

while( <$in> )  {   
    ## replace higher ASCII characters such as a-umlaut etc. with codes.
    s#\x94#oumlautklein#g;
    s#\x84#aumlautklein#g;
    s#\x81#uumlautklein#g;
    ## ... and some more. (ö, Ö, ä, Ä, Ü, ü, ê, è, é, É, â, á, à, ì, î, 
    ## û, ù, ô, ò, ç, ï, a°, e-umlaut and ñ in total.)

    ## replace problematic special characters (ß, ú, ó, í, ø, ') with codes.
    s#(?<![\x80-\xFF])\xE1(?![\x80-\xFF])#eszett#g;
    s#(?<![\x80-\xFF])\xA3(?![\x80-\xFF])#uaccentaiguklein#g;
    s#(?<![\x80-\xFF])\xA2(?![\x80-\xFF])#oaccentaiguklein#g;
    s#(?<![\x80-\xFF])\xA1(?![\x80-\xFF])#iaccentaiguklein#g;
    s#(?<![\x80-\xFF])\xED(?![\x80-\xFF])#nordischesoklein#g;

    print $out $_;
    }   
close $out;
close $in;
}

sub convert2unicode {

open(our $in,  "< :encoding(GB18030)", $input)  or die "$!\n";
open(our $out, "> :encoding(UTF-8)",  $output)  or die "$!\n";

print "Convert ASCII to UTF-8\n\n";

while (<$in>) {         
        print $out $_;      
}

close $in;
close $out;
}

sub codes2char {

open(our $in,  "< :encoding(UTF-8)", $input)    or die "$!\n";
open(our $out, "> :encoding(UTF-8)", $output)   or die "$!\n";

print "replace Codes with original characters.\n";


    while (<$in>) {
        s#lidosoumlautklein#\xF6#g;
        s#lidosaumlautklein#\xE4#g;
        s#lidosuumlautklein#\xFC#g;
        ## ... and some more.
        s#eszett#\xDF#g;
        s#uaccentaiguklein#\xFA#g;
        s#oaccentaiguklein#\xF3#g;
        s#iaccentaiguklein#\xED#g;
        s#nordischesoklein#\xF8#g;

        print  $out $_;
    }
close($in)   or die "can't close $input: $!";
close($out)  or die "can't close $output: $!";
}

##################
## Main program ##
##################

&specialchars2codes;
&switch_var;
&convert2unicode;
&switch_var;
&codes2char;

wow, this was long. I hope it's not too convoluted

EDIT:

This is a hexdump of the example string above:

01A36596                                                        2B 41                    +A
01A365A9   75 74 6F 72 3A 0D 0A 2D  59 61 6E 2C 20 4C 69 61  6E 6B 65   utor:  -Yan, Lianke
01A365BC   20 2F 20 D1 D6 C1 AC BF  C6 0D 0A 2B 43 6F 2D 41  75 74 6F    / ÑÖÁ¬¿Æ  +Co-Auto
01A365CF   72 3A 0D 0A 2D 4D 69 6E  2C 20 4A 69 65 20 2F 20  28 9A 62   r:  -Min, Jie / (šb
01A365E2   65 72 73 2E 29 0D 0A                                         ers.)  

and another two to illustrate:

1.

000036B3                                                     2D 52 75                   -Ru
000036C6   E1 6C 61 6E 64 0D 0A                                         áland  

2.

015FE030            2B 54 69 74 65  6C 3A 0D 0A 2D 57 65 6E  72 6F 75      +Titel:  -Wenrou
015FE043   64 75 6E 68 6F 75 20 20  CE C2 C8 E1 B6 D8 BA F1  20 28 47   dunhou  ÎÂÈá¶Øºñ (G
015FE056   65 6E 74 6C 65 6E 65 73  73 20 61 6E 64 20 4B 69  6E 64 6E   entleness and Kindn
015FE069   65 73 73 29 2E 0D 0A                                         ess).  

In both cases, there is the Hex-value E1. In the first case, it stands in place for a German sharp-s (ß, "Rußland"="Russia") and in the second instance it is part of the multi-byte CJK character 柔 (reading: "rou").

In the library program, the Chinese characters are entered and displayed with an additional program which has to be loaded first and, as far as I can tell, is hooked into the graphics driver at a low-level, catching encoded Chinese characters and displaying them as characters while leaving everything else alone. The German umlauts etc. are handled by the library program itself.

I don't fully understand how this works, i.e. how the programs know whether HexE1 is to be treated as a single character á and thus converted according to codepage X and when it is part of a multi-byte character and thus converted according to codepage Y

The closest approximation I have found is that a special characters is likely to be part of a chinese string if there are other special characters before or behind it. (e.g. ÎÂÈá¶Øºñ)

like image 551
screen12345 Avatar asked Aug 01 '11 11:08

screen12345


People also ask

Is it more efficient to use ASCII or UTF-8 as an encoding?

There's no difference between ASCII and UTF-8 when storing digits. A tighter packing would be using 4 bits per digit (BCD). If you want to go below that, you need to take advantage of the fact that long sequences of 10-base values can be presented as 2-base (binary) values. Save this answer.

Can UTF-8 handle all languages?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL).

What is UTF-8 and what problem does it solve?

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn't use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer.

What is the advantage of UTF-8?

Spatial efficiency is a key advantage of UTF-8 encoding. If instead every Unicode character was represented by four bytes, a text file written in English would be four times the size of the same file encoded with UTF-8. Another benefit of UTF-8 encoding is its backward compatibility with ASCII.


1 Answers

  1. If the mixed encoding is such that each line/record/field/whatever is in a consistent encoding, you can read and convert each line/record/field/whatever individually. But that doesn't sound like the case here.
  2. Wouldn't be a bad idea.
  3. UTF-8 doesn't normally use a BOM, although if you really want to try it output character U+FEFF (in UTF-8, that's the 3 bytes ef bb bf). It would be better if you can figure out why exactly NP++ is misdetecting the file.
  4. When reading a UTF-8-encoded file, opening it with the UTF-8 input layer is a good idea. If you want, <:utf8 is a shorter equivalent to < :encoding(UTF-8).

As for how the original mess works, it seems that the "additional program" just converts anything that looks to it like a Chinese character into Chinese and leaves anything else alone (which the standard drivers then display using a European encoding), while the "library program" just outputs whatever codes it receives. So a more straightforward way to convert your file might be to mirror this: read in the file using :encoding(latin-1) (or whatever) and then replace the Chinese characters (e.g. s/\xc8\xe1/柔/).

like image 128
Anomie Avatar answered Oct 19 '22 09:10

Anomie