When I use Apache Tika to determine the file type from the content. XML file is fine but not the json. If content type is json, it will return "text/plain" instead of "application/json".
Any help?
public static String tiKaDetectMimeType(final File file) throws IOException {
TikaInputStream tikaIS = null;
try {
tikaIS = TikaInputStream.get(file);
final Metadata metadata = new Metadata();
return DETECTOR.detect(tikaIS, metadata).toString();
} finally {
if (tikaIS != null) {
tikaIS.close();
}
}
}
All text-based and multimedia files can be parsed using a common interface, making Tika a powerful and versatile library for content analysis. In this article, we'll give an introduction to Apache Tika, including its parsing API and how it automatically detects the content type of a document.
1. Overview Apache Tika is a toolkit for extracting content and metadata from various types of documents, such as Word, Excel, and PDF or even multimedia files like JPEG and MP4. All text-based and multimedia files can be parsed using a common interface, making Tika a powerful and versatile library for content analysis.
The Tika Server binary is a standalone runnable jar. Download the latest stable release binary from the Apache Tika downloads page, via your favorite local mirror. You want the tika-server-1.x.jar file, e.g. tika-server-1.24.jar You can start it by calling java with the -jar option, eg something like java -jar tika-server-1.24.jar
We're excited! ApacheCon NA is coming to Atlanta, Georgia, at the Westin Peachtree, and Tika is being repped as part of the Lucene and friends track on Friday, November 5th, 2010. Chris Mattmann will give a talk on how Tika is being used at NASA and in the context of other projects in the Apache ecosystem.
JSON is based on plain text, so it's not altogether surprising that Tika reported it as such when given only the bytes to work with.
Your problem is that you didn't also supply the filename, so Tika didn't have that to work with. If you had, Tika could've said bytes=plain text + filename=json => json
and given you the answer you expected
The line you're missing is:
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
So the fixed code snippet would be:
tikaIS = TikaInputStream.get(file);
final Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
return DETECTOR.detect(tikaIS, metadata).toString();
With that, you'll get back an answer of JSON as you were expecting
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With