Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to accurately determine mime data from a file?

I'm adding some functionality to a program so that I can accurately determine the files type by reading the MIME data. I've already tried a few methods:

Method 1:

javax.activation.FileDataSource

FileDataSource ds = new FileDataSource("~\\Downloads\\777135_new.xls");  
String contentType = ds.getContentType();  
System.out.println("The MIME type of the file is: " + contentType);

//output = The MIME type of the file is: application/octet-stream

Method 2:

import net.sf.jmimemagic.*;

try
{
    RandomAccessFile f = new RandomAccessFile("~\\Downloads\\777135_new.xls", "r");
    byte[] fileBytes = new byte[(int)f.length()];
    f.read(fileBytes);
    MagicMatch match = Magic.getMagicMatch(fileBytes);
    System.out.println("The Mime type is: " + match.getMimeType());
}
catch(Exception e)
{
    System.out.println(e);
}

//output = The Mime type is: application/msword

Method 3:

import eu.medsea.mimeutil.*;

MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
File f = new File ("~\\Downloads\\777135_new.xls");
Collection<?> mimeTypes = MimeUtil.getMimeTypes(f);
String mimeType = MimeUtil.getFirstMimeType(mimeTypes.toString()).toString();
String subMimeType = MimeUtil.getSubType(mimeTypes.toString());
System.out.println("The Mime type is: " + mimeTypes + ", " + mimeType + ", " + subMimeType);

//output = The Mime type is: application/msword, application/msword, msword

I found these three methods at http://www.rgagnon.com/javadetails/java-0487.html. However my problem is that the file I am testing these methods on is one I created and so I know it's an Excel file, but still all three methods are incorrectly picking up the type as msword except the first method which I believe is because of the limited number of file types in the built in FileTypeMap that the method uses.

I've had a look around and some people say that it's because the way the offset is detected in the files and so the content type is picked up incorrectly, as pointed out in this wiki on detecting file types in PHP. Unfortunately the wiki then goes on to use the extension to determine the file type which isn't what I want to do as it's unreliable.

Can anyone point me in the right direction to a method that will detect the file types correctly within Java please?

Cheers, Alexei Blue.

Edit: Looks like there is no specific solution to this as @IronMensan said in the comment below. I did find this really interesting research paper that applies machine learning in a few ways to help the issue but there doesn't seem to be a full proof answer. I think my best bet here will be to try and pass the file to an excel file reader and catch any incorrect format exceptions.

like image 705
Alexei Blue Avatar asked Dec 13 '11 11:12

Alexei Blue


2 Answers

So far, the most accurate tool I've found to determine a file's MIME type is Apache Tika. This is a slight modification of what I currently use (with Tika version 1.0)

import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MimeTypes;

private static final Detector DETECTOR = new DefaultDetector(
        MimeTypes.getDefaultMimeTypes());

public static String detectMimeType(final File file) throws IOException {
    TikaInputStream tikaIS = null;
    try {
        tikaIS = TikaInputStream.get(file);

        /*
         * You might not want to provide the file's name. If you provide an Excel
         * document with a .xls extension, it will get it correct right away; but
         * if you provide an Excel document with .doc extension, it will guess it
         * to be a Word document
         */
        final Metadata metadata = new Metadata();
        // metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());

        return DETECTOR.detect(tikaIS, metadata).toString();
    } finally {
        if (tikaIS != null) {
            tikaIS.close();
        }
    }
}

Since Tika will use magic numbers, but also look at the contents of files when unsure, the process can be a little time-expensive (it took 3.268 secs for my PC to examine 15 files).

Also, don't make the same mistake I did at first. If you get the tika-core JAR, you should also get the tika-parsers JAR. If you don't get tika-parsers you won't get any exceptions, you will simply not get the MIME type accurately, so it is REALLY important to include it.

An alternative is to get the tika-app JAR, which contains tika-core, tika-parsers and all of the dependencies (they are a lot: poi, poi-ooxml, xmlbeans, commons-compress, just to name a few).

like image 183
rodrigo.garcia Avatar answered Oct 16 '22 07:10

rodrigo.garcia


As mentioned in the comments since there's so many possible file types it could be hit and miss for ALL possibile files, but you probably know the types of files you are typically going to be dealing with. This excellent list of magic numbers has helped me do detection recently around the specific office formats you mentioned (search for Microsoft Office) and you'll see that the MS office file types have a sub-type specified (which is further into the file) and lets you work out specifically which type of file you have. Many new formats like ODT, DOCX, OOXML etc use a ZIP file to hold their data so you might need to detect zip first, then look for specifics.

like image 23
Paul Jowett Avatar answered Oct 16 '22 06:10

Paul Jowett