Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify binary and text files using Python? [duplicate]

I need identify which file is binary and which is a text in a directory.

I tried use mimetypes but it isnt a good idea in my case because it cant identify all files mimes, and I have strangers ones here... I just need know, binary or text. Simple ? But I couldn´t find a solution...

Thanks

like image 709
Thomas Avatar asked Sep 18 '09 19:09

Thomas


People also ask

How do I know if a file is binary or text in Python?

You can check if: 1) file contains \n 2) Amount of bytes between \n's is relatively small (this is NOT reliable)l 3) file doesn't bytes with value less than value of ASCCI "space" character (' ') - EXCEPT "\n" "\r" "\t" and zeroes.

How do you identify a text file in Python?

Look for the Unicode byte-order-mark at the start of the file. If the file is regularly 00 xx 00 xx 00 xx (for arbitrary xx) or vice versa, that's quite possibly UTF-16. Otherwise, look for 0s in the file; a file with a 0 in is unlikely to be a single-byte-encoding text file.


1 Answers

Thanks everybody, I found a solution that suited my problem. I found this code at http://code.activestate.com/recipes/173220/ and I changed just a little piece to suit me.

It works fine.

from __future__ import division
import string 

def istext(filename):
    s=open(filename).read(512)
    text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
    _null_trans = string.maketrans("", "")
    if not s:
        # Empty files are considered text
        return True
    if "\0" in s:
        # Files with null bytes are likely binary
        return False
    # Get the non-text characters (maps a character to itself then
    # use the 'remove' option to get rid of the text characters.)
    t = s.translate(_null_trans, text_characters)
    # If more than 30% non-text characters, then
    # this is considered a binary file
    if float(len(t))/float(len(s)) > 0.30:
        return False
    return True
like image 66
Thomas Avatar answered Sep 30 '22 21:09

Thomas