Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tell binary from text files in linux

The linux file command does a very good job in recognising file types and gives very fine-grained results. The diff tool is able to tell binary files from text files, producing a different output.

Is there a way to tell binary files form text files? All I want is a yes/no answer whether a given file is binary. Because it's difficult to define binary, let's say I want to know if diff will attempt a text-based comparison.

To clarify the question: I do not care if it's ASCII text or XML as long as it's text. Also, I do not want to differentiate between MP3 and JPEG files, as they're all binary.

like image 855
gabor Avatar asked Apr 15 '10 11:04

gabor


2 Answers

This approach defers to the grep command in determining whether a file is binary or text:

is_text_file() { grep -qIF '' "$1"; }

grep options used:

  • -q Quiet; Exit immediately with zero status if any match is found
  • -I Process a binary file as if it did not contain matching data
  • -F Interpret PATTERNS as fixed strings, not regular expressions.

grep pattern used:

  • '' Empty string. All files (except an empty file) will match this pattern.

Notes

  • An empty file is not considered a text file according to this test. (The GNU file command agrees with this assessment.)
  • A file with one printable character, say a, is considered a text file according to this test. (Makes sense to me.) (The file command disagrees with this assessment. (Tested with GNU file))
  • This approach requires only one child process to test whether a file is text or binary.

Test

# cd into a temp directory
cd "$(mktemp -d)"

# Create 3 corner-case test files
touch empty_file       # An empty file
echo -n a >one_byte_a  # A file containing just `a`
echo a >one_line_a     # A file containing just `a` and a newline

# Another test case: a 96KiB text file that ends with a NUL
head -c 98303 /usr/share/dict/words > file_with_a_null_96KiB
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB

# Last test case: a 96KiB text file plus a NUL added at the end
head -c 98304 /usr/share/dict/words > file_with_a_null_96KiB_plus1
dd if=/dev/zero bs=1 count=1 >> file_with_a_null_96KiB_plus1

# Defer to grep to determine if a file is a text file
is_text_file() { grep -qI '^' "$1"; }

# Test harness
do_test() {
  printf '%22s ... ' "$1"
  if is_text_file "$1"; then
    echo "is a text file"
  else
    echo "is a binary file"
  fi
}

# Test each of our test cases
do_test empty_file
do_test one_byte_a
do_test one_line_a
do_test file_with_a_null_96KiB
do_test file_with_a_null_96KiB_plus1

Output

            empty_file ... is a binary file
            one_byte_a ... is a text file
            one_line_a ... is a text file
file_with_a_null_96KiB ... is a binary file
file_with_a_null_96KiB_plus1 ... is a text file

On my machine, it seems grep checks the first 96 KiB of a file for a NUL. (Tested with GNU grep). The exact crossover point depends on your machine's page size.

Relevant source code: https://git.savannah.gnu.org/cgit/grep.git/tree/src/grep.c?h=v3.6#n1550

like image 176
Robin A. Meade Avatar answered Nov 06 '22 21:11

Robin A. Meade


The diff manual specifies that

diff determines whether a file is text or binary by checking the first few bytes in the file; the exact number of bytes is system dependent, but it is typically several thousand. If every byte in that part of the file is non-null, diff considers the file to be text; otherwise it considers the file to be binary.

like image 42
David Schmitt Avatar answered Nov 06 '22 23:11

David Schmitt