Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fix encoding of incoherently encoded text file

I have a long text file which uses apparently different encodings in subsequent blocks of text (iso or utf-8). It is the result of appending text using >> file.bib and copy and paste from different sources (webpages).

The blocks can in principle be distinguished as they are bibtex entries

 @article{key, author={lastname, firstname}, ...}

I would like to convert it to a coherent utf-8 file since it seems to crash my bibtex viewer (kbibtex). I know that I can use iconv to convert the encoding of entire files, but I would like to know if there is a way to fix my file without corrupting some of the entries.

like image 681
highsciguy Avatar asked May 21 '12 14:05

highsciguy


People also ask

How do I fix corrupted character encoding?

Go to "File" -> "Options" -> "Advanced" and scroll down until the "General" section is reached. In the "General" section, check the box that says "Confirm file format conversion on open." Exit Word, and reopen the corrupt document again. The dialogue box will appear.


2 Answers

If you can assume uniform encoding for each line AND you know the alternate encoding:

#!/usr/bin/perl
use Encode;
while(<>) {
      my $line;
      eval {
        $line=Encode::decode_utf8( $_ );
      }
      if ($@) $line=Encode::decode( 'iso-8859-1', $_ ); #not UTF-8
      # Now $line is UNICODE.Do something to it

} 

You can still do the same by words if the lines are mixed encoding, but you still know what is the alternate encoding. If do not know the alternate encoding, or if you have more than one, you need to use some encode-guessing library, which may well guess wrong.

like image 173
Alien Life Form Avatar answered Oct 14 '22 07:10

Alien Life Form


I use vim for this, but I guess it can be done in any editor.

  • Select (shift+v) a block of text that you want to change encoding on.

  • type :!enca -L lang - (replace 'lang' with your language, I use 'enca -L cs'. enca utility should then tell you the most probable encoding of the selected block)

  • press u (so you undo the answer of enca that appeared in your text)

  • select the block again, this time running :!iconv -f determined_encoding -t UTF-8

Note that vim automatically expands pressed : to :\<,> when you're in visual mode, which is exactly what you want for running programs on text blocks.

like image 39
exa Avatar answered Oct 14 '22 06:10

exa