Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching over documents stored in Hadoop - which tool to use?

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...

When you read about the one you can be often sure that each of the others tools is going to be mentioned.

I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.

The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).

Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.

Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results

Thank you!

like image 965
garret Avatar asked Jul 18 '12 18:07

garret


1 Answers

Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.

But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.

1) Use hadoop cluster to store data.

2) Extract data in hadoop cluster using map/redcue

3) Do document identification( identify document type)

4) Extract meta data from these document.

5) Index metadata in solr server, store other ingestion information in database

6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.

7) When search is required search on all the indexs.

8) Solr supports all the complex searches , so you don't have to make your own search engine.

9) It also does paging for you as well.

like image 60
Animesh Raj Jha Avatar answered Sep 27 '22 22:09

Animesh Raj Jha