Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommendations for a spidering tool to use with Lucene or Solr? [closed]

What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be.

like image 439
BuddyJoe Avatar asked Nov 12 '08 00:11

BuddyJoe


People also ask

Should I use Solr or Lucene?

A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine. You can't drive an engine, but you can drive a car. Similarly, Lucene is a programmatic library which you can't use as-is, whereas Solr is a complete application which you can use out-of-box.

What is the difference between Solr and Lucene?

Solr is built on top of lucene to provide a search platform. SOLR is a wrapper over Lucene index. It is simple to understand: SOLR is car and Lucene is its engine. You just need to know how to drive car (SOLR) and also need to know few things of engine (Lucene) in case if there will be any issue in your car engine.

Is SOLR based on Lucene?

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.

What is Lucene used for?

Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.


2 Answers

In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy.
I've tried every open-source crawler that I can find, and none of them integrates out-of-the-box with Solr.
Keep an eye on OpenPipeline and Apache Tika.

like image 67
Geordie Avatar answered Sep 24 '22 00:09

Geordie


I've tried nutch, but it was very difficult to integrate with Solr. I would take a look at Heritrix. It has an extensive plugin system to make it easy to integrate with Solr, and it is much much faster at crawling. It makes extensive use of threads to speed up the process.

like image 20
John Avatar answered Sep 24 '22 00:09

John