Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting MimeType subtype with Apache tika

I'd need to get the iana.org MediaType rather than application/zip or application/x-tika-msoffice for documents like, odt, ppt, pptx, xlsx etc.

If you look at mimetypes.xml there are mimeType elements composed of the iana.org mime-type and "sub-class-of"

   <mime-type type="application/msword">
    <alias type="application/vnd.ms-word"/>
    ............................
    <glob pattern="*.doc"/>
    <glob pattern="*.dot"/>
    <sub-class-of type="application/x-tika-msoffice"/>
  </mime-type>

How to get the iana.org mime-type name instead of the parent type name ?

When testing mime type detection, I do :

MediaType mediaType = MediaType.parse(tika.detect(inputStream));
String mimeType = mediaType.getSubtype();

Test Results :

FAILED: getsCorrectContentType("application/vnd.ms-excel", docs/xls/en.xls)
java.lang.AssertionError: expected:<application/vnd.ms-excel> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("vnd.openxmlformats-officedocument.spreadsheetml.sheet", docs/xlsx/en.xlsx)
java.lang.AssertionError: expected:<vnd.openxmlformats-officedocument.spreadsheetml.sheet> but was:<zip>

FAILED: getsCorrectContentType("application/msword", doc/en.doc)
java.lang.AssertionError: expected:<application/msword> but was:<x-tika-msoffice>

FAILED: getsCorrectContentType("application/vnd.openxmlformats-officedocument.wordprocessingml.document", docs/docx/en.docx)
java.lang.AssertionError: expected:<application/vnd.openxmlformats-officedocument.wordprocessingml.document> but was:<zip>

FAILED: getsCorrectContentType("vnd.ms-powerpoint", docs/ppt/en.ppt)
java.lang.AssertionError: expected:<vnd.ms-powerpoint> but was:<x-tika-msoffice>

Is there any way to get the actual subtype from mimetypes.xml ? Instead of x-tika-msoffice or application/zip ?

Moreover I never get application/x-tika-ooxml, but application/zip for xlsx, docx, pptx documents.

like image 336
lisak Avatar asked Aug 21 '11 10:08

lisak


2 Answers

Originally, Tika only supported detection by Mime Magic or by file extension (glob), as this is all most mime detection before Tika did.

Because of the problems with Mime Magic and globs when it comes to detecting container formats, it was decided to add some new detectors to Tika to handle these. The Container Aware Detectors took the whole file, opened and processed the container, and then worked out the exact file type based on the contents. Initially, you needed to call them explicitly, but then they were wrapped up in ContainerAwareDetector which you'll see in some of the answers.

Since then, Tika has added a service loader pattern, initially for Parsers. This allowed classes to be auto-loaded when present, with a general way to identify which ones were appropriate and use those. This support was then extended to cover Detectors too, at which point the old ContainerAwareDetector could be removed in favour of something cleaner.

If you're on Tika 1.2 or later, and you want accurate detection of all formats, including container formats, you want to do something like:

 TikaConfig config = TikaConfig.getDefaultConfig();
 Detector detector = config.getDetector();

 TikaInputStream stream = TikaInputStream.get(fileOrStream);

 Metadata metadata = new Metadata();
 metadata.add(Metadata.RESOURCE_NAME_KEY, filenameWithExtension);
 MediaType mediaType = detector.detect(stream, metadata);

If you run this with only the Core Tika jar (tika-core-1.2-....), then the only detector present will be the mime magics one, and you'll get the old style detection based on magic + glob only. However, if you run this with both the Core and Parser Tika jars (plus their dependencies), or from Tika App (which includes core + parsers + dependencies automatically), then the DefaultDetector will use all the various different Container Detectors to process your file. If your file is zip based, then detection will include processing the zip structure to identify the file type based on what's in there. This will give you the high accuracy detection you're after, without needing to call lots of different parsers in turn. DefaultDetector will use all Detectors that are available.

like image 107
Gagravarr Avatar answered Sep 28 '22 11:09

Gagravarr


For anyone else having a similar problem but using newer Tika version this should do the trick:

  1. Use ZipContainerDetector since you may have no ContainerAwareDetector any more.
  2. Give a TikaInputStream to the detect() method of the detector to ensure tika can analyze the correct mime type.

My example code looks like this:

public static String getMimeType(final Document p_document)
{
    try
    {
        Metadata metadata = new Metadata();
        metadata.add(Metadata.RESOURCE_NAME_KEY, p_document.getDocName());

        Detector detector = getDefaultDectector();

        LogMF.debug(log, "Trying to detect mime type with detector {0}.", detector);
        TikaInputStream inputStream = TikaInputStream.get(p_document.getData(), metadata);

        return detector.detect(inputStream, metadata).toString();
    }
    catch (Throwable t)
    {
        log.error("Error while determining mime-type of " + p_document);
    }

    return null;
}

private static Detector getDefaultDectector()
{
    if (detector == null)
    {
        List<Detector> detectors = new ArrayList<>();

        // zip compressed container types
        detectors.add(new ZipContainerDetector());
        // Microsoft stuff
        detectors.add(new POIFSContainerDetector());
        // mime magic detection as fallback
        detectors.add(MimeTypes.getDefaultMimeTypes());

        detector = new CompositeDetector(detectors);
    }

    return detector;
}

Note that the Document class is part of my domain model. So you will for sure have something similar at that line.

I hope that someone can use this.

like image 26
Sebastian Götz Avatar answered Sep 28 '22 12:09

Sebastian Götz