According to Nutch news the latest version of Nutch is 2.3.1 compatible with Solr 4.10.3 which is very old version of solr.
Can we integrate Solr 6 with Nutch 2.3.1. What will be the drawbacks if solr 6 will be integrated? Anybody tried this?
This is an old question but I just got Nutch 1.12 talking to Solr 6.3.0. The required schema/solrconfig changes should be the same for Nutch 2.x so here's what I did:
Download and extract both products into some directory, e.g. ~/mycrawler, then go into the solr directory and create a core for nutch:
solr-6.3.0/bin $ ./solr start
solr-6.3.0/bin $ ./solr create_core -c nutch -d basic_configs
solr-6.3.0/bin $ ./solr stop
This will create solr-6.3.0/server/solr/nutch where the schema etc. will be located. Now, we need to remove the new auto-managed schema definition and replace it with the nutch-supplied schema.xml:
solr-6.3.0/server/solr/nutch/conf $ rm managed-schema
solr-6.3.0/server/solr/nutch/conf $ cp ~/mycrawler/apache-nutch-1.12/conf/schema.xml .
Now edit schema.xml and remove all instances of enablePositionIncrements="true"
in all <filter class="solr.StopFilterFactory" ignoreCase="true" ...
definitions.
Also in solr-6.3.0/server/solr/nutch/conf/solrconfig.xml
, comment these typeMapping blocks, so you get:
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">strings</str>
<!--
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">tdates</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Long</str>
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">tlongs</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">tdoubles</str>
</lst>
-->
</processor>
Now start the server again:
solr-6.3.0/bin $ ./solr start
If you go to the admin gui, it should show the core as started with no further schema issues.
Now the crawl script can be run and will successfully write into our bleeding edge Solr (this is probably slightly different for Nutch 2):
./crawl -i \
-D solr.server.url=http://localhost:8983/solr/nutch \
~/mycrawler/nutch_work/seed \
~/mycrawler/nutch_work/crawl \
1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With