Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr 6 and Nutch 2.3.1 integration

Tags:

solr

nutch

According to Nutch news the latest version of Nutch is 2.3.1 compatible with Solr 4.10.3 which is very old version of solr.

Can we integrate Solr 6 with Nutch 2.3.1. What will be the drawbacks if solr 6 will be integrated? Anybody tried this?

like image 551
MTA Avatar asked Jan 05 '23 16:01

MTA


1 Answers

This is an old question but I just got Nutch 1.12 talking to Solr 6.3.0. The required schema/solrconfig changes should be the same for Nutch 2.x so here's what I did:

Download and extract both products into some directory, e.g. ~/mycrawler, then go into the solr directory and create a core for nutch:

solr-6.3.0/bin $ ./solr start
solr-6.3.0/bin $ ./solr create_core -c nutch -d basic_configs
solr-6.3.0/bin $ ./solr stop

This will create solr-6.3.0/server/solr/nutch where the schema etc. will be located. Now, we need to remove the new auto-managed schema definition and replace it with the nutch-supplied schema.xml:

solr-6.3.0/server/solr/nutch/conf $ rm managed-schema
solr-6.3.0/server/solr/nutch/conf $ cp ~/mycrawler/apache-nutch-1.12/conf/schema.xml .

Now edit schema.xml and remove all instances of enablePositionIncrements="true" in all <filter class="solr.StopFilterFactory" ignoreCase="true" ... definitions.

Also in solr-6.3.0/server/solr/nutch/conf/solrconfig.xml, comment these typeMapping blocks, so you get:

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
  <str name="defaultFieldType">strings</str>
    <!--
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Boolean</str>
    <str name="fieldType">booleans</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.util.Date</str>
    <str name="fieldType">tdates</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Long</str>
    <str name="valueClass">java.lang.Integer</str>
    <str name="fieldType">tlongs</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Number</str>
    <str name="fieldType">tdoubles</str>
  </lst>
    -->
</processor>

Now start the server again:

solr-6.3.0/bin $ ./solr start

If you go to the admin gui, it should show the core as started with no further schema issues.

Now the crawl script can be run and will successfully write into our bleeding edge Solr (this is probably slightly different for Nutch 2):

./crawl -i \
    -D solr.server.url=http://localhost:8983/solr/nutch \ 
    ~/mycrawler/nutch_work/seed \
    ~/mycrawler/nutch_work/crawl  \
    1
like image 74
C. Ramseyer Avatar answered Jan 16 '23 15:01

C. Ramseyer