I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :
word count, author, title, timestamps, language etc.
which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.
I know that apache tika should remove the need for this, but the document formats are quite different right ?
For instance
InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();
for(String s : metadata.names()) {
    System.out.println("Metadata name : "  + s);
}
I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?
Could please anybody who has experience with it share his experience ? Thank you
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
To extract Microsoft office files such as xls file, Tika provides OOXMLParser class. This class is used to extract content and metadata from the Microsoft files.
tika. config. ServiceLoader class provides a registry of each type of provider. This allows Tika to create implementations such as org.
Tika-Python is Python binding to the Apache TikaTM REST services allowing tika to be called natively in python language. Installation: To install Tika type the below command in the terminal. For extracting contents from the PDF files we will use from_file() method of parser object.
Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.
You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());
if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
   // Do something special with the PDF metadata here
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With