Recommendations for a spidering tool to use with Lucene or Solr? [closed]

2 Answers

In my opinion, this is a pretty significant hole which is keeping down the widespread adoption of Solr. The new DataImportHandler is a good first step to import structured data, but there is not a good document ingestion pipeline for Solr. Nutch does work, but the integration between Nutch crawler and Solr is somewhat clumsy.
I've tried every open-source crawler that I can find, and none of them integrates out-of-the-box with Solr.
Keep an eye on OpenPipeline and Apache Tika.

answered Sep 24 '22 00:09

Geordie

I've tried nutch, but it was very difficult to integrate with Solr. I would take a look at Heritrix. It has an extensive plugin system to make it easy to integrate with Solr, and it is much much faster at crawling. It makes extensive use of threads to speed up the process.

answered Sep 24 '22 00:09

John

Related questions
                            
                                Extract tf-idf vectors with lucene
                            
                                Difference(s) between Solr's Cursor and ElasticSearch's Scroll
                            
                                Document Similarity in ElasticSearch
                            
                                Find all Lucene documents having a certain field
                            
                                How to clear the cache in Solr?
                            
                                lucene Fields vs. DocValues
                            
                                Searching names with Apache Solr
                            
                                Which is the best choice to indexing a Boolean value in lucene?
                            
                                Synonyms using Lucene
                            
                                Full-text search for static HTML files on CD-Rom via javascript
                            
                                Should an index be optimised after incremental indexes in Lucene?
                            
                                Is {Filter}ing faster than {Query}ing in Lucene?
                            
                                What's the difference between query_string and multi_match?
                            
                                Solr DIH -- How to handle deleted documents?
                            
                                Lucene index backup
                            
                                Scoring of solr multivalued field
                            
                                How do I see/debug the way SOLR find it's results?
                            
                                How do I sort Lucene results by field value using a HitCollector?
                            
                                Does Lucene.Net manage multiple threads accessing the same index, one indexing while the other is searching?
                            
                                How to get all documents of lucene index?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Recommendations for a spidering tool to use with Lucene or Solr? [closed]

Tags:

solr

lucene

web-crawler

BuddyJoe

People also ask

2 Answers

Geordie

John

Recent Activity

Donate For Us