I need to build a list of all the file extensions of binary files located within a directory tree.
The main question would need to be how to distinguish a text file from a binary one, and the rest should be cake.
EDIT: This is the closest I got, any better ideas?
find . -type f|xargs file|grep -v text|sed -r 's:.*\.(.*)\:.*:\1:g'
Here's a trick to find the binary files:
grep -r -m 1 "^" <Your Root> | grep "^Binary file"
The -m 1 makes grep not read all the file.
This perly one-liner worked for me, it was also quite fast:
find . -type f -exec perl -MFile::Basename -e 'print (-T $_ ? "" : (fileparse ($_, qr/\.[^.]*/))[2] . "\n" ) for @ARGV' {} + | sort | uniq
and this is how you can find all binary files in the current folder:
find . -type f -exec perl -e 'print (-B $_ ? "$_\n" : "" ) for @ARGV' {} +
-T is a test for text files, and -B for binary, and they are opposites of each other*.
*perl file tests doc
There is no difference between a binary file and a text file on Linux. The file
utility looks at the contents and guesses. Unfortunately, it's not of much help because file
doesn't produce a simple "binary or text" answer; it has a complex output with a large number of cases that you would have to parse.
One approach is to read some fixed-sized prefix of a file, like say 256 bytes, and then apply some heuristics. For instance, are all the byte values 0x0 to 0x7F, avoiding control codes except for common whitespace? That suggests ASCII? If there are bytes 0x80 through 0xFF, does the entire buffer (except for one code at the end which may be chopped) decode as valid UTF-8? Etc.
One idea might be to sneakily exploit utilities which detect binary files, like GNU diff
.
$ diff -r /bin/ls <(echo foo)
Binary files /bin/ls and /dev/fd/63 differ
Without process substitution, still works:
$ diff -r /bin/ls /dev/null
Binary files /bin/ls and /dev/null differ
Now just grep the output of that and look for the word Binary
.
The question is whether diff
's heuristic for binary files works for your purposes.
There is no sure way to differentiate a "text" file from a "binary" file, it is guess work.
#!/bin/bash
guess=`echo \`head -c 4096 $1 | strings -a -n 1 | wc -c \` '* 1.05 /' \`head -c 4096 $1 | wc -c \` | bc `;
if [ $guess -eq 1 ] ; then
echo $1 "is text file"
exit 0
else
echo $1 "is binary file"
exit 1
fi
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With