Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Refactoring auto-detect file's encoding

I need to check encoding files. This code work but it's a little bit long. How able to make any refactoring this logic. Maybe can to use some another variant for this target?

Code:

class CharsetDetector implements Checker {

    Charset detectCharset(File currentFile, String[] charsets) {
        Charset charset = null;

        for (String charsetName : charsets) {
            charset = detectCharset(currentFile, Charset.forName(charsetName));
            if (charset != null) {
                break;
            }
        }

        return charset;
    }

    private Charset detectCharset(File currentFile, Charset charset) {
        try {
            BufferedInputStream input = new BufferedInputStream(
                    new FileInputStream(currentFile));

            CharsetDecoder decoder = charset.newDecoder();
            decoder.reset();

            byte[] buffer = new byte[512];
            boolean identified = false;
            while ((input.read(buffer) != -1) && (!identified)) {
                identified = identify(buffer, decoder);
            }

            input.close();

            if (identified) {
                return charset;
            } else {
                return null;
            }

        } catch (Exception e) {
            return null;
        }
    }

    private boolean identify(byte[] bytes, CharsetDecoder decoder) {
        try {
            decoder.decode(ByteBuffer.wrap(bytes));
        } catch (CharacterCodingException e) {
            return false;
        }
        return true;
    }

    @Override
    public boolean check(File fileChack) {
        if (charsetDetector(fileChack)) {
            return true;
        }
        return false;
    }

    private boolean charsetDetector(File currentFile) {
        String[] charsetsToBeTested = { "UTF-8", "windows-1253", "ISO-8859-7" };

        CharsetDetector charsetDetector = new CharsetDetector();
        Charset charset = charsetDetector.detectCharset(currentFile,
                charsetsToBeTested);

        if (charset != null) {
            try {
                InputStreamReader reader = new InputStreamReader(
                        new FileInputStream(currentFile), charset);

                @SuppressWarnings("unused")
                int valueReaders = 0;
                while ((valueReaders = reader.read()) != -1) {
                    return true;
                }

                reader.close();
            } catch (FileNotFoundException exc) {
                System.out.println("File not found!");
                exc.printStackTrace();
            } catch (IOException exc) {
                exc.printStackTrace();
            }
        } else {
            System.out.println("Unrecognized charset.");
            return false;
        }

        return true;
    }
}

Question:

  • How does this program logic refactor?
  • Which are another ways to detect encoding (as UTF-16 sequance etc.)?
like image 860
catch23 Avatar asked Mar 01 '13 09:03

catch23


2 Answers

the best way to refactor this code would be to bring in a 3rd party library that does character detection for you, because they probably do it better and it would make your code smaller. see this question for a few alternatives

like image 104
radai Avatar answered Oct 11 '22 17:10

radai


As has been pointed out, you can't "know" or "detect" the encoding of a file. Complete accuracy requires that you be told, as there is almost always a byte sequence which is ambiguous with respect to several character encodings.

You'll find some more discussion about detecting UTF-8 vs ISO8859-1 in this SO question.. The essential answer is to check each byte sequence in the file to verify its compatibility with the encoding expected. For UTF-8 byte encoding rules, see http://en.wikipedia.org/wiki/UTF-8.

In particular, there's a very interesting paper on detecting character encodings/sets http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html They claim they have extremely high accuracy (guesses!). The price is a very complex detection system, complete with knowledge about character frequencies in different languages, that doesn't fit in the 30 lines OP has hinted as being the right code size. Apparently the detection algorithm is built into Mozilla, so you can likely find and extract it.

We settled for a much simpler scheme: a) believe what you are told the character set is, if you are told b) if not, check for BOM and believe what it says if present, otherwise sniff for pure 7 bit ascii, then utf8, or iso8859 in that order. You can build an ugly routine that does this in one pass over the file.

(I think the problem is going to get worse over time. Unicode has a new revision every year, with truly subtle differences in valid code points. To do that right, you need to check every code point for validity. If we're lucky, they're all backwards compatible.)

[EDIT: OP seems to be having trouble coding this in Java. Our solution and the sketch on the other page are not coded in Java so I can't copy and paste an answer directly. I'm going to draft a Java version here based on his code; it isn't compiled or tested. YMMV]

int UTF8size(byte[] buffer, int buf_index)
// Java-version of character-sniffing test on other page
// This only checks for UTF8 compatible bit-pattern layout
// A tighter test (what we actually did) would check for valid UTF-8 code points
{   int first_character=buffer[buf_index];

    // This first character test might be faster as a switch statement
    if ((first_character & 0x80) == 0) return 1; // ASCII subset character, fast path
    else ((first_character & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (buf_index+3>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80)
         && ((buffer[buf_index + 3] & 0xC0) == 0x80))
            return 4;
    }
    else if ((first_character & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (buf_index+2>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80))
            return 3;
    }
    else if ((first_character & 0xE0) == 0xC0) { // start of 2-byte sequence
        if (buf_index+1>=buffer.length) return 0;
        if ((buffer[buf_index + 1] & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

public static boolean isUTF8 ( File file ) {
    int file_size;
    if (null == file) {
        throw new IllegalArgumentException ("input file can't be null");
    }
    if (file.isDirectory ()) {
        throw new IllegalArgumentException ("input file refers to a directory");
    }

    file_size=file.size();
    // read input file
    byte [] buffer = new byte[file_size];
    try {
        FileInputStream fis = new FileInputStream ( input ) ;
        fis.read ( buffer ) ;
        fis.close ();
    }
    catch ( IOException e ) {
        throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () );
    }

    { int buf_index=0;
      int step;

      while (buf_index<file_size) {
         step=UTF8size(buffer,buf_index);
         if (step==0) return false; // definitely not UTF-8 file
         buf_index+=step;

      }

    }

   return true ; // appears to be UTF-8 file
}
like image 25
Ira Baxter Avatar answered Oct 11 '22 16:10

Ira Baxter