Apache Tika and document metadata

Tags:

I'm doing simple processing of variety of documents (ODS, MS office, pdf) using Apache Tika. I have to get at least :

word count, author, title, timestamps, language etc.

which is not so easy. My strategy is using Template method pattern for 6 types of document, where I find the type of document first, and based on that I process it individually.

I know that apache tika should remove the need for this, but the document formats are quite different right ?

For instance

Click to copy

InputStream input = this.getClass().getClassLoader().getResourceAsStream(doc);
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser parser = new OfficeParser();
parser.parse(input, textHandler, metadata, new ParseContext());
input.close();

for(String s : metadata.names()) {
    System.out.println("Metadata name : "  + s);
}

I tried to do this for ODS, MS office, pdf documents, and the metadada differs a lot. There is MSOffice interface that lists metadata keys for MS documents and some Dublic Core metadata list. But how should one implement an application like this ?

Could please anybody who has experience with it share his experience ? Thank you

815

asked Feb 26 '11 21:02

lisak

1 Answers

Generally the parsers should return the same metadata key for the same kind of thing across all document formats. However, there are some kinds of metadata that only occur in some file types, so you won't get those from others.

You might want to just use the AutoDetectParser, and if you need to do anything special with the metadata handle that afterwards based on the mimetype, eg

Click to copy

Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
ParseContext context = new ParseContext();

Parser parser = new AutoDetectParser();
parser.parse(input, textHandler, metadata, new ParseContext());

if(metadata.get(CONTENT_TYPE).equals("application/pdf")) {
   // Do something special with the PDF metadata here
}

137

answered Sep 22 '22 06:09

Gagravarr

Related questions
                            
                                Using and controlling Spring transactions within Struts 2 actions
                            
                                What is the best way to make a copy of an InputStream in java [duplicate]
                            
                                Decryption Error: Pad block corrupted
                            
                                I'd like to apply a regex efficiently to an entire file
                            
                                How are software and game templates designed?
                            
                                Why is Java offended by my use of Long.parseLong(String s, int radix) with this long binary number?
                            
                                Test Driven Development, Unit Testing
                            
                                How to call a stored procedure in Hibernate?
                            
                                Shuffling array in multiple threads
                            
                                Update persistent object in Hibernate
                            
                                how to compute the average with mongodb and NumberLong
                            
                                Interface with Java Content Assist in Eclipse
                            
                                changing the font in a JTextArea for different lines
                            
                                A library for reliable sending emails from Java application - with buffering, retrying, etc [closed]
                            
                                Inspect the return value of a method in jdb
                            
                                How to create Sphinx-based documentation in a Jython project?
                            
                                What is the difference between java FX applet and java applet?
                            
                                How can I run my haskell functions through Java
                            
                                Spring unit test case is not rolling back insertion of a record
                            
                                Good java webmail applications [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Tika and document metadata

Tags:

java

metadata

apache

apache-tika

documents

lisak

People also ask

1 Answers

Gagravarr

Recent Activity

Donate For Us