Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to extract table infomation using Apache Tika?

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?

like image 570
rajesh Avatar asked Nov 22 '12 16:11

rajesh


People also ask

What is Apache Tika used for?

Apache Tika is a content type detection and content extraction framework. Tika provides a general application programming interface that can be used to detect the content type of a document and also parse textual content and metadata from several document formats.

What is the Tika library used for?

Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses existing various document parsers and document type detection techniques to detect and extract data.

Which API does Apache Tika use for Analysing Microsoft Office file types?

To extract Microsoft office files such as xls file, Tika provides OOXMLParser class. This class is used to extract content and metadata from the Microsoft files.

What is Tika Parser?

tika. parser. Parser interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents.


1 Answers

Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1

So basically we can write a custom SAX implementation to parse the file.

The structure text output will be of the form (Meta details avoided)

<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>

In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

Override public void characters(char[] ch, int start, int length) with the logic

Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution

like image 155
rajesh Avatar answered Sep 22 '22 14:09

rajesh