I need identify which file is binary and which is a text in a directory.
I tried use mimetypes but it isnt a good idea in my case because it cant identify all files mimes, and I have strangers ones here... I just need know, binary or text. Simple ? But I couldn´t find a solution...
Thanks
You can check if: 1) file contains \n 2) Amount of bytes between \n's is relatively small (this is NOT reliable)l 3) file doesn't bytes with value less than value of ASCCI "space" character (' ') - EXCEPT "\n" "\r" "\t" and zeroes.
Look for the Unicode byte-order-mark at the start of the file. If the file is regularly 00 xx 00 xx 00 xx (for arbitrary xx) or vice versa, that's quite possibly UTF-16. Otherwise, look for 0s in the file; a file with a 0 in is unlikely to be a single-byte-encoding text file.
Thanks everybody, I found a solution that suited my problem. I found this code at http://code.activestate.com/recipes/173220/ and I changed just a little piece to suit me.
It works fine.
from __future__ import division
import string
def istext(filename):
s=open(filename).read(512)
text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
_null_trans = string.maketrans("", "")
if not s:
# Empty files are considered text
return True
if "\0" in s:
# Files with null bytes are likely binary
return False
# Get the non-text characters (maps a character to itself then
# use the 'remove' option to get rid of the text characters.)
t = s.translate(_null_trans, text_characters)
# If more than 30% non-text characters, then
# this is considered a binary file
if float(len(t))/float(len(s)) > 0.30:
return False
return True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With