Is it possible to extract table infomation using Apache Tika?

Tags:

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my requirement is to extract tabular data where I am expecting 2 columns in a key value format. I checked most of the stuff available in the net for a solution but could not find any. Any pointers for this?

570

asked Nov 22 '12 16:11

rajesh

1 Answers

Well I went ahead and implemented it separately using apache poi for the MS formats. I came back to Tika for PDF. What Tika does with the docs is that it will output it as "SAX based XHTML events"1

So basically we can write a custom SAX implementation to parse the file.

The structure text output will be of the form (Meta details avoided)

<body><div class="page"><p/>
<p>Key1 Value1 </p>
<p>Key2 Value2 </p>
<p>Key3 Value3</p>
<p/>
</div>
</body>

In our SAX implementation we can consider the first part as key (for my problem I already know the key and I am looking for values, so it is a substring).

Override public void characters(char[] ch, int start, int length) with the logic

Please note for my case the structure of the content is fixed and I know the keys that are coming in, so it was easy doing it this way. This is not a generic solution

155

answered Sep 22 '22 14:09

rajesh

Related questions
                            
                                Canonicalizing Java bean property names
                            
                                SHA256withRSA sign from PHP verify from JAVA
                            
                                How do I make wsimport generate constructors?
                            
                                How to test if a ThreadLocal has been initialized without actually doing that?
                            
                                Spring security authrorize based on input parameter criteria
                            
                                How to fetch entire row as array of objects with JDBC
                            
                                List of reserved words in Android
                            
                                Is Calendar.getInstance().getTime() ever going to give me a different answer than new Date()?
                            
                                How to let the content in JComboBox display in the center?
                            
                                java.util.zip.ZipException: invalid CEN header (bad signature)
                            
                                request.getCharacterEncoding() returns NULL... why?
                            
                                Binary Search O(log n) algorithm to find duplicate in sequential list?
                            
                                "550 SSL/TLS required on the data channel" using Apache Commons FTPSClient
                            
                                How to filter file type in FileDialog?
                            
                                how to convert string to number without using library function [closed]
                            
                                Clear JFileChooser selection after adding files to a JList
                            
                                calling containsKey on a hashmap with custom class
                            
                                How to turn uppercase to lowercase using the charAt method?
                            
                                When Should I synchronize the methods of my class?
                            
                                Is it possible to have an "in-process" Tomcat instance, for testing purposes?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it possible to extract table infomation using Apache Tika?

Tags:

java

apache-tika

rajesh

People also ask

1 Answers

rajesh

Recent Activity

Donate For Us