Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

I found following two topics about how to identify encodings for files,

  • How can I identify different encodings without the use of a BOM?

  • Java: Readers and Encodings

Currently, I created a class to identify different encodings for files (e.g. UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

}

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. Since under this circumstance, the logic for checking if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default.

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ?

Thanks, any idea would be appreciated.

like image 764
Eason Avatar asked Nov 04 '22 23:11

Eason


2 Answers

Generally speaking, there is no way to know encoding for sure if it is not provided.

You may guess UTF-8 by specific pattern in the texts (high bit set, set, set, not set, set, set, set, not set), but it is still a guess.

UTF-16 is a hard one; you can successfully parse BE and LE on the same stream; both ways it will produce some characters (potentially meaningless text though).

Some code out there uses statistical analysis to guess the encoding by the frequency of the symbols, but that requires some assumptions about the text (i.e. "this is a Mongolian text") and frequencies tables (which may not match the text). At the end of the day this remains just a guess, and cannot help in 100% of cases.

like image 167
Vladimir Dyuzhev Avatar answered Nov 14 '22 00:11

Vladimir Dyuzhev


The best approach is not to try and implement this yourself. Instead use an existing library to do this; see Java : How to determine the correct charset encoding of a stream. For instance:

  • http://code.google.com/p/juniversalchardet/
  • http://jchardet.sourceforge.net/
  • http://site.icu-project.org/
  • http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
  • http://docs.codehaus.org/display/GUESSENC/Home

It should be noted that the best that can be done is to guess at the most likely encoding for the file. In the general case, it is impossible to be 100% sure that you've figured out the correct encoding; i.e. the encoding that was used when creating the file.


I would say these third-party libraries are also cannot identify encodings for the file I encountered [...] they could be improved to meet my requirement.

Alternatively, you could recognize that your requirement is exceedingly hard to meet ... and change it; e.g.

  • restrict yourself to a certain set of encodings,
  • insist that the person who provides / uploads the file correctly state what its encoding (or primary language) is, and/or
  • accept that your system is going to get it wrong a certain percent of the time, and provide the means whereby someone can correct incorrectly stated / guessed encodings.

Face the facts: this is a THEORETICALLY UNSOLVABLE problem.

like image 23
Stephen C Avatar answered Nov 14 '22 00:11

Stephen C