Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

nutch vs solr indexing

Tags:

solr

lucene

nutch

I have recently started working on nutch and I am trying to understand how it works. As far as I know Nutch is basically used to crawl the web and solr/Lucene is used to index and search. But when I read documentation on nutch, it says that nutch also does inverted indexing. Does it uses Lucene internally to do indexing or does it have some other library for indexing? If it uses solr/lucene for indexing then why is it necessary to configure solr with nutch as the nutch tutorial says?

Is the indexing done by default. I mean I run this command to start crawling. Is indexing happening here?

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Or does indexing happen only in this case. (According to tutorial: If you have a Solr core already set up and wish to index to it, you are required to add the -solr parameter to your crawl command e.g.)

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
like image 368
CRS Avatar asked Jun 01 '12 05:06

CRS


2 Answers

Having a look here might be useful. When you run the first command:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

you're crawling, which means that nutch will create its own internal data, composed of:

  • the crawldb
  • the linkdb
  • a set of segments

you can see them in the following directories, which are created while you run the crawl command:

  • crawl/crawldb
  • crawl/linkdb
  • crawl/segments

You can think of that data as some kind of database where nutch stores crawled data. That doesn't have anything to do with an inverted index.

After the crawl process you can index your data on a Solr instance. You can crawl and then index running a single command, which is the second command from your question:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

Otherwise you can run a second command after the crawl command, specific for indexing to Solr, but you have to provide the path of your crawldb, linkdb and segments:

bin/nutch solrindex http://localhost:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/*
like image 66
javanna Avatar answered Nov 20 '22 17:11

javanna


You may be getting confused by legacy Nutch versions and associated online documentation. Originally it created its own index and had its own web search interface. Using Solr became an option requiring extra configuration and fiddling. Starting with 1.3 the indexing and server parts were stripped out and now it's assumed Nutch will be using Solr.

like image 26
John Reece Avatar answered Nov 20 '22 18:11

John Reece