What is a good crawler (spider) to use against HTML and XML documents (local or web-based) and that works well in the Lucene / Solr solution space? Could be Java-based but does not have to be.
A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine. You can't drive an engine, but you can drive a car. Similarly, Lucene is a programmatic library which you can't use as-is, whereas Solr is a complete application which you can use out-of-box.
Solr is built on top of lucene to provide a search platform. SOLR is a wrapper over Lucene index. It is simple to understand: SOLR is car and Lucene is its engine. You just need to know how to drive car (SOLR) and also need to know few things of engine (Lucene) in case if there will be any issue in your car engine.
Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.
Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.
In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy.
I've tried every open-source crawler that I can find, and none of them integrates out-of-the-box with Solr.
Keep an eye on OpenPipeline and Apache Tika.
I've tried nutch, but it was very difficult to integrate with Solr. I would take a look at Heritrix. It has an extensive plugin system to make it easy to integrate with Solr, and it is much much faster at crawling. It makes extensive use of threads to speed up the process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With