I'm using Apache Tika to detect a file Mime Type from its base64 rapresentation. Unfortunately I don't have other info about the file (e.g. extension).
Is there something I can do to make Tika be more specific?
I'm currently using this:
Tika tika = new Tika();
tika.setMaxStringLength(-1);
String mimetype = tika.detect(Base64.decode(fileString));
and it gives me text/plain
for JSON and PDF files, but I would like to obtain a more specific information: application/json
, application/pdf
etc...
Hope someone can help me!
Thanks.
Tika#detect(String)
Detects the media type of a document with the given file name.
Passing the content of a PDF or JSON file won't work as this method expects a filename. Tika will fallback to text/plain
as it won't find any matching filenames.
For PDF, you just need to either write some of the data to a stream, or pass it some of the bytes and have Tika read that using Mime Magic Detection by looking for special ("magic") patterns of bytes near the start of the file (which in plain text is %PDF
):
String pdfContent = "%PDF-1.4\n%\\E2\\E3\\CF\\D3"; // i.e. base64 decoded
Tika tika = new Tika();
System.out.println(tika.detect(pdfContent.getBytes())); // "application/pdf"
JSON
For JSON though, even this method will return text/plain
& Tika is correct. application/json
is like a subtype of plain text to indicate that the text should be interpreted differently. So that's what you'll have to do if you get text/plain
. Use a JSON library (e.g. Jackson) to parse the content to see if it's valid JSON:
Sring json = "[1, 2, 3]"; // an array in JSON
try {
final JsonParser parser = new ObjectMapper().getFactory().createParser(json);
while (parser.nextToken() != null) {
}
System.out.println("Probably JSON!");
} catch (Exception e) {
System.out.println("Definitely not JSON!");
}
Just be careful about how strict you want to be since Jackson treats a single number 1
as valid JSON but it's not really. To get round that, you could 1st of all test that the string starts with either {
or [
(possibly preceded by whitespace) using something like json.matches("^\\s*[{\\[].*")
before even attempting to parse it as JSON.
Here's a DZone tutorial for Jackson.
In my past project I used TikaConfig
What I did is:
//Note you can use alse byte[] instead of InputStream
InputStream is = new FileInputStream(new File(YOUR_FILE));
TikaConfig tc = new TikaConfig();
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, fileName);
String mimeType = tc.getDetector().detect(TikaInputStream.get(is), md).toString();
By using byte[]
:
byte[] fileBytes = GET_BYTE_ARRAY_FROM_YOUR_FILE;
TikaConfig tc = new TikaConfig();
Metadata md = new Metadata();
md.set(Metadata.RESOURCE_NAME_KEY, fileName);
String mimeType = tc.getDetector().detect(TikaInputStream.get(fileBytes), md).toString();
I had no issue in getting the right mimeType....
I hope it is useful
Angelo
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With