Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reliably detect file types? [duplicate]

Objective: given the file, determine whether it is of a given type (XML, JSON, Properties etc)

Consider the case of XML - Up until we ran into this issue, the following sample approach worked fine:

    try {
        saxReader.read(f);
    } catch (DocumentException e) {
        logger.warn("  - File is not XML: " + e.getMessage());
        return false;
    }
    return true;

As expected, when XML is well formed, the test would pass and method would return true. If something bad happens and file can't be parsed, false will be returned.

This breaks however when we deal with a malformed XML (still XML though) file.

I'd rather not rely on .xml extension (fails all the time), looking for <?xml version="1.0" encoding="UTF-8"?> string inside the file etc.

Is there another way this can be handled?

What would you have to see inside the file to "suspect it may be XML though DocumentException was caught". This is needed for parsing purposes.

like image 832
James Raitsev Avatar asked Mar 16 '12 13:03

James Raitsev


3 Answers

File type detection tools:

  • Mime Type Detection Utility
  • DROID (Digital Record Object Identification)
  • ftc - File Type Classifier
  • JHOVE, JHOVE2
  • NLNZ Metadata Extraction Tool
  • Apache Tika
  • TrID, TrIDNet
  • Oracle Outside In (commercial)
  • Forensic Innovations File Investigator TOOLS (commercial)
like image 197
Lior Kogan Avatar answered Oct 20 '22 20:10

Lior Kogan


Apache Tika gives me the least amount of issues and is not platform specific unlike Java 7 : Files.probeContentType

import java.io.File;
import java.io.IOException;
import javax.activation.MimeType;
import org.apache.tika.Tika;

File inputFile = ...
String type = new Tika().detect(inputFile);
System.out.println(type);

For a xml file I got 'application/xml'

for a properties file I got 'text/plain'

You can however add a Detector to the new Tika()

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.xx</version>
</dependency>
like image 36
rjdkolb Avatar answered Oct 20 '22 20:10

rjdkolb


For those who do not need very precise detection (the Java 7's Files.probeContentType method mentioned by rjdkolb)

Path filePath = Paths.get("/path/to/your/file.jpg");
String contentType = Files.probeContentType(filePath);
like image 2
kazy Avatar answered Oct 20 '22 20:10

kazy