I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.
While going thru tika, I came across POI API and found more friendly to use it.
we may have requirement to parse PDF file in further.
I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.
Thanks, Krishna
Apache POI is an API provided by Apache foundation which is a collection of different java libraries. These libraries gives the facility to read, write and manipulate different Microsoft files such as excel sheet, power-point, and word files.
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.
Apache Tika is an open source tool that extracts metadata and text from over a thousand different file types, for example, PPT, XLS, and PDF. You can parse the file types through a single interface, which makes Tika useful for search engine indexing, content analysis, conversion, and more.
Apache Tika is a library that is used for document type detection and content extraction from various file formats.
Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content
Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.
If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With