Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check whether the file is binary?

I wrote the following method to see whether particular file contains ASCII text characters only or control characters in addition to that. Could you glance at this code, suggest improvements and point out oversights?

The logic is as follows: "If first 500 bytes of a file contain 5 or more Control characters - report it as binary file"

thank you.

public boolean isAsciiText(String fileName) throws IOException {

    InputStream in = new FileInputStream(fileName);
    byte[] bytes = new byte[500];

    in.read(bytes, 0, bytes.length);
    int x = 0;
    short bin = 0;

    for (byte thisByte : bytes) {
        char it = (char) thisByte;
        if (!Character.isWhitespace(it) && Character.isISOControl(it)) {

            bin++;
        }
        if (bin >= 5) {
            return false;
        }
        x++;
    }
    in.close();
    return true;
}
like image 854
James Raitsev Avatar asked Jun 22 '10 13:06

James Raitsev


People also ask

How can you tell if a file is binary?

We can usually tell if a file is binary or text based on its file extension. This is because by convention the extension reflects the file format, and it is ultimately the file format that dictates whether the file data is binary or text.

How do you tell if a file is binary or Ascii in Windows?

You can either check for values with ASCII code <128 or for some charset you define (e.g. 'a'-'z','A'-'Z','0'-'9'...) and treat the file as binary if it contains some other characters. You could also check for regular linebreaks (0x10 or 0x13,0x10) to detect text files.

Is a file a binary?

A binary file is a file whose content is in a binary format consisting of a series of sequential bytes, each of which is eight bits in length. The content must be interpreted by a program or a hardware processor that understands in advance exactly how that content is formatted and how to read the data.


1 Answers

Since you call this class "isASCIIText", you know exactly what you're looking for. In other words, it's not "isTextInCurrentLocaleEncoding". Thus you can be more accurate with:

if (thisByte < 32 || thisByte > 127) bin++;

edit, a long time later — it's pointed out in a comment that this simple check would be tripped up by a text file that started with a lot of newlines. It'd probably be better to use a table of "ok" bytes, and include printable characters (including carriage return, newline, and tab, and possibly form feed though I don't think many modern documents use those), and then check the table.

like image 176
Pointy Avatar answered Oct 17 '22 16:10

Pointy