Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect MIME type of plain text files: CSS, Javascript, ini, sql?

Detecting the MIME type of a file with PHP is trivial - just use PEAR's MIME_Type package, PHP's fileinfo or call file -i on a Unix machine. This works really well for binary files and all others that have some kind of "magic bytes" through which they can be detected easily.

What I'm failing at is detecting the correct MIME type of plain text files:

  • CSS
  • Diff
  • INI (configuration)
  • Javascript
  • rST
  • SQL

All of them are identified as "text/plain", which is correct, but too unspecific for me. I need the real type, even if it costs some time to analyze the file content.

So my question: Which solutions exist to detect the MIME type of such plain text files? Any Libraries? Code snippets?


Note that I neither have a filename nor a file extension, but I have the file content.


If I used ruby, I could integrate github's linguist. Ohloh's ohcount is written in C, but has a command line tool to detect the type: ohcount -d $file

What I've tried

ohcount

Detects xml and php files correctly, all other not.

Apache tika

Detects xml and html, all other tests files were only seen as text/plain.

like image 308
cweiske Avatar asked May 08 '12 19:05

cweiske


People also ask

How would you determine the MIME type of a file?

The MIME type registry associates particular filename extensions and filename patterns, with particular MIME types. If a match for the filename is found, the MIME type associated with the extension or pattern is the MIME type of the file.

What is the MIME type of a plain text?

Two primary MIME types are important for the role of default types: text/plain is the default value for textual files. A textual file should be human-readable and must not contain binary data. application/octet-stream is the default value for all other cases.

What is the MIME type of CSS files?

CSS files used to style a Web page must be sent with text/css . If a server doesn't recognize the .css suffix for CSS files, it may send them with text/plain or application/octet-stream MIME types.

What is the MIME type of Javascript?

application/javascript is now officially obsolete; text/javascript is the only correct JavaScript MIME type.


2 Answers

Since I didn't find a proper library, I wrote my own magic file that detects all of my test files properly.

My application first tries my custom magic file for detection and falls back to the normal/system magic file if no type is detected.

The code it on github, see https://github.com/cweiske/MIME_Type_PlainDetect . The magic file is at data/programming.magic and can be used with file -f programming.magic /path/to/source

like image 52
cweiske Avatar answered Oct 21 '22 00:10

cweiske


I think Magical detection from Apache Tika could help you:

http://tika.apache.org/

like image 20
Pier-Alexandre Bouchard Avatar answered Oct 20 '22 23:10

Pier-Alexandre Bouchard