Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Tika and File access instead of Java Input Stream

I want to be able to create a new Tika parser to extract metadata from a file. We're already using Tika and the metadata extraction will be done consistently.

I think that I've run into this problem/enhancement request for Tika:

Allow passing of files or memory buffers to parsers

I have a console c++ executable that accepts the path to a file on input and then outputs the metadata that it finds, each line consisting of name/value pairs.
The c++ code relies on libraries that expect a file path when accessing the data. It's not going to be possible to rewrite this executable in Java. I thought that it would be fairly easy to plug this into Tika. But the Tika parser needs to be in Java and the Tika parser method that needs to be overridden takes an open input stream:

void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)

So I guess that my only solution will be to take the input stream and write it to a temporary file and then to process the file that gets written and to then finally clean up the file. I hate messing with a temporary file and then potentially having to worry about cleanup of temp files should something go wrong and it doesn't get deleted.

Does anyone have a clever idea about how to cleanly deal with something like this?

like image 998
George Avatar asked May 17 '11 21:05

George


1 Answers

There's TikaInputStream which should help. It handles wrapping a File or an InputStream, and converting between them internally as parsers require. It does all the temp file bits as needed for you.

Several Java parsers already make use of it because they need a File rather than an Input Stream. What's more, users who have a file can pass it to the Parser wrapped as an InputStream, and the parser can read it as either a File or an InputStream as their needs suit.

So, I'd suggest you just turn the InputStream into a TikaInputStream (which is just a cast if it's already one), then get the file and pass that to your c++.

like image 193
Gagravarr Avatar answered Sep 27 '22 22:09

Gagravarr