Detecting the MIME type of a file with PHP is trivial - just use PEAR's MIME_Type package, PHP's fileinfo or call file -i
on a Unix machine.
This works really well for binary files and all others that have some kind of "magic bytes" through which they can be detected easily.
What I'm failing at is detecting the correct MIME type of plain text files:
All of them are identified as "text/plain", which is correct, but too unspecific for me. I need the real type, even if it costs some time to analyze the file content.
So my question: Which solutions exist to detect the MIME type of such plain text files? Any Libraries? Code snippets?
Note that I neither have a filename nor a file extension, but I have the file content.
If I used ruby, I could integrate github's linguist. Ohloh's ohcount is written in C, but has a command line tool to detect the type: ohcount -d $file
Detects xml and php files correctly, all other not.
Detects xml and html, all other tests files were only seen as text/plain
.
The MIME type registry associates particular filename extensions and filename patterns, with particular MIME types. If a match for the filename is found, the MIME type associated with the extension or pattern is the MIME type of the file.
Two primary MIME types are important for the role of default types: text/plain is the default value for textual files. A textual file should be human-readable and must not contain binary data. application/octet-stream is the default value for all other cases.
CSS files used to style a Web page must be sent with text/css . If a server doesn't recognize the .css suffix for CSS files, it may send them with text/plain or application/octet-stream MIME types.
application/javascript is now officially obsolete; text/javascript is the only correct JavaScript MIME type.
Since I didn't find a proper library, I wrote my own magic file that detects all of my test files properly.
My application first tries my custom magic file for detection and falls back to the normal/system magic file if no type is detected.
The code it on github, see https://github.com/cweiske/MIME_Type_PlainDetect .
The magic file is at data/programming.magic and can be used with file -f programming.magic /path/to/source
I think Magical detection from Apache Tika could help you:
http://tika.apache.org/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With