I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?
Apache Tika is a content type detection and content extraction framework. Tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual content and metadata from several document formats.
Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses existing various document parsers and document type detection techniques to detect and extract data.
To extract Microsoft office files such as xls file, Tika provides OOXMLParser class. This class is used to extract content and metadata from the Microsoft files.
tika. parser. Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents.
Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1
So basically we can write a custom SAX implementation to parse the file.
The structure text output will be of the form (Meta details avoided)
<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>
In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).
Override public void characters(char[] ch, int start, int length) with the logic
Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With