Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Tika and document metadata

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :

word count, author, title, timestamps, language etc.

which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.

I know that apache tika should remove the need for this, but the document formats are quite different right ?

For instance

InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();

for(String s : metadata.names()) {
    System.out.println("Metadata name : "  + s);
}

I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?

Could please anybody who has experience with it share his experience ? Thank you

like image 815
lisak Avatar asked Feb 26 '11 21:02

lisak


People also ask

What is Apache Tika used for?

The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Which API does Apache Tika use for Analysing Microsoft Office file types?

To extract Microsoft office files such as xls file, Tika provides OOXMLParser class. This class is used to extract content and metadata from the Microsoft files.

What is Tika config?

tika. config. ServiceLoader class provides a registry of each type of provider. This allows Tika to create implementations such as org.

How do you use Tika in Python?

Tika-Python is Python binding to the Apache TikaTM REST services allowing tika to be called natively in python language. Installation: To install Tika type the below command in the terminal. For extracting contents from the PDF files we will use from_file() method of parser object.


1 Answers

Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.

You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg

Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();

Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());

if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
   // Do something special with the PDF metadata here
}
like image 137
Gagravarr Avatar answered Sep 22 '22 06:09

Gagravarr