Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tell the original encoding of a file

Tags:

java

encoding

I have a bunch of plain text file which I downloaded from 3rd party servers. Some of them are gibberish; the server sent the information of ENCODING1 (e.g.: UTF8), but in reality the encoding of the file was ENCODING2 (e.g.: Windows1252).

Is there a way to somehow correct these files?

I presume the files were (ENCODING1) mostly encoded in UTF8, ISO-8859-2 and Windows1252 (and I presume they were mostly saved with one of these encodings). I was thinking about re-encoding every filecontent with

new String(String.getBytes(ENCODING1), ENCODING2)

with all possibilites of ENCODING1 and ENCODING2 (for 3 encodings that would be 9 options) then finding some way (for example: charachter frequency?) to tell which of the 9 results is the correct one.

Are there any 3rd party libraries for this?

I tried JChardet and ICU4J, but as far as I know both of them are only capable of detecting the encoding of the file before the step with ENCODING1 took place

Thanks, krisy

like image 426
krisy Avatar asked Nov 02 '22 13:11

krisy


1 Answers

You can use library provided by google to detect character set for a file, please see following:

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector
{
    public static void main(String[] args) throws java.io.IOException
    {
        if (args.length != 1) {
            System.err.println("Usage: java TestDetector FILENAME");
            System.exit(1);
        }

        byte[] buf = new byte[4096];
        String fileName = args[0];
        java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

        // (1)
        UniversalDetector detector = new UniversalDetector(null);

        // (2)
        int nread;
        while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }
        // (3)
        detector.dataEnd();

        // (4)
        String encoding = detector.getDetectedCharset();
        if (encoding != null) {
            System.out.println("Detected encoding = " + encoding);
        } else {
            System.out.println("No encoding detected.");
        }

        // (5)
        detector.reset();
    }
} 

Read more at following URL

You can also try jCharDet by sourceforge, please see following URL

Cheers !!

like image 54
Sachin Thapa Avatar answered Nov 08 '22 05:11

Sachin Thapa