Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Tika and Json

When I use Apache Tika to determine the file type from the content. XML file is fine but not the json. If content type is json, it will return "text/plain" instead of "application/json".

Any help?

public static String tiKaDetectMimeType(final File file) throws IOException {
    TikaInputStream tikaIS = null;
    try {
        tikaIS = TikaInputStream.get(file);
        final Metadata metadata = new Metadata();
        return DETECTOR.detect(tikaIS, metadata).toString();
    } finally {
        if (tikaIS != null) {
            tikaIS.close();
        }
    }
}
like image 425
songjing Avatar asked Oct 17 '13 06:10

songjing


People also ask

How does Apache Tika parse content?

All text-based and multimedia files can be parsed using a common interface, making Tika a powerful and versatile library for content analysis. In this article, we'll give an introduction to Apache Tika, including its parsing API and how it automatically detects the content type of a document.

What is a tika file?

1. Overview Apache Tika is a toolkit for extracting content and metadata from various types of documents, such as Word, Excel, and PDF or even multimedia files like JPEG and MP4. All text-based and multimedia files can be parsed using a common interface, making Tika a powerful and versatile library for content analysis.

How to start Tika server in Java?

The Tika Server binary is a standalone runnable jar. Download the latest stable release binary from the Apache Tika downloads page, via your favorite local mirror. You want the tika-server-1.x.jar file, e.g. tika-server-1.24.jar You can start it by calling java with the -jar option, eg something like java -jar tika-server-1.24.jar

Where's Tika at ApacheCon na 2010?

We're excited! ApacheCon NA is coming to Atlanta, Georgia, at the Westin Peachtree, and Tika is being repped as part of the Lucene and friends track on Friday, November 5th, 2010. Chris Mattmann will give a talk on how Tika is being used at NASA and in the context of other projects in the Apache ecosystem.


1 Answers

JSON is based on plain text, so it's not altogether surprising that Tika reported it as such when given only the bytes to work with.

Your problem is that you didn't also supply the filename, so Tika didn't have that to work with. If you had, Tika could've said bytes=plain text + filename=json => json and given you the answer you expected

The line you're missing is:

metadata.set(Metadata.RESOURCE_NAME_KEY, filename);

So the fixed code snippet would be:

tikaIS = TikaInputStream.get(file);
final Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
return DETECTOR.detect(tikaIS, metadata).toString();

With that, you'll get back an answer of JSON as you were expecting

like image 154
Gagravarr Avatar answered Oct 08 '22 14:10

Gagravarr