Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

determining whether a MIME type is binary or text-based

Is there a library which allows determining whether a given content type is binary or text-based?

Obviously text/* is always textual, but for things like application/json, image/svg+xml or even application/x-latex it's rather tricky without inspecting the actual data.

like image 848
AnC Avatar asked Oct 07 '10 06:10

AnC


People also ask

What is a MIME subtype?

The subtype identifies the exact kind of data of the specified type the MIME type represents. For example, for the MIME type text, the subtype might be plain (plain text), html (HTML source code), or calendar (for iCalendar/.ics) files.

How can I tell the difference between text and binary files?

Count the number of character vs. non-character types. Text files will be mostly alphabetical characters while binary files - especially compressed ones like rar, zip, and such - will tend to have bytes more evenly represented.

What is the best Java library to determine the MIME type?

9 Have a look at the JMimeMagiclibrary. jMimeMagic is a Java library for determining the MIME type of files or streams. Share Follow

What are the MIME types in Python?

The MIME types provide the name which will be used to identify each file type. Developers many times do not know the MIME type of the file and need it to be determined by itself. Python provides a module named mimetypes that provides a list of methods that has a mapping from file extensions to MIME type and vice-versa.


2 Answers

I don't know of a definitive list of binary and non-binary MIME types, but for the Common MIME types I think the following does pretty well.

def is_binary(mime_type, subtype):
    if mime_type == "text":
        return False
    if mime_type != "application":
        return True
    return subtype not in ["json", "ld+json", "x-httpd-php", "x-sh", "x-csh", "xhtml+xml", "xml"]
like image 91
W.P. McNeill Avatar answered Sep 30 '22 17:09

W.P. McNeill


There's a wrapper for libmagic for python -- pymagic. Thats the easiest method to accomplish what you want. Keep in mind that magic is only as good as the fingerprint. You can have false-positives if something 'looks' like another file format, but most cases pymagic will give you what you need.

One thing to watch out for would be the 'simple solution' of checking to see if any of the characters are 'outside' the printable ASCII range, as you will likely encounter unicode which will look like binary (and in fact, be binary) even though it's just textual content.

like image 37
synthesizerpatel Avatar answered Sep 30 '22 16:09

synthesizerpatel