Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using nutch in Windows 7

I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception:

Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-ajayn\mapred\staging\ajayn-1231695575\.staging to 0700

I have been searching extensively over the net but there is no concrete solution. Please note that I have no hadoop instances installed or running in the system and my sole purpose is the try out nutch as a web crawling agent.

Is it even possible to run nutch 1.6 in windows and if yes any pointers as to how to go about it and avoid the above exception.

PS: if it helps, the /tmp/ folder has a Read Only attribute attached to it and it does not change even if you try to do so. Also from cygwin I have tried to set the file permissions 777, but every time I try to run the nutch instance, a new folder eg: "ajayn-1231695575" is created that does not have any execution rights.

Thanks

Ajay

like image 345
Ajay Nair Avatar asked Dec 24 '12 07:12

Ajay Nair


People also ask

How do I start Apache Nutch?

Option 1: Setup Nutch from a binary distribution Unzip your binary Nutch package. There should be a folder apache-nutch-1. X . From now on, we are going to use ${NUTCH_RUNTIME_HOME } to refer to the current directory ( apache-nutch-1.

What is Apache Nutch project?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

What is nutch SOLR?

Nutch is an open source crawler which provides the Java library for crawling, indexing and database storage. Solr is an open source search platform which provides full-text search and integration with Nutch. The following contents are steps of setting up Nutch and Solr for crawling and searching.


1 Answers

Did you try GettingNutchRunningWithWindows from the Nutch Wiki?

Some of my students experimented a lot and here is the result of their work:

Tested with nutch 1.7 - http://www.apache.org/dyn/closer.cgi/nutch/1.7/apache-nutch-1.7-bin.zip You'll also need cygwin.

1) Extract nutch to path without spaces. For example:

 d:\dev\ir\nutch-1.7

2) Copy jdk to some place without spaces. I attempted to make a symlink inside cygwin instead, but it did not go well. For example

xcopy /S "C:\Program Files\Java\jdk1.7.0_21" c:\jdk1.7.0_21

3) In cygwin setup the paths to java

3.1) export JAVA_HOME=/cygdrive/c/jdk1.7.0_21

3.2) export PATH=$JAVA_HOME/bin:$PATH

3,3) Check that all is correct by calling which java. Should return /cygdrive/c/jdk1.7.0_21/bin/java

SO FAR - fixed the first problem - with incorrect java paths. Now to the second problem - hadoop patching.

4) Patch hadoop

https://issues.apache.org/jira/browse/HADOOP-7682
https://github.com/congainc/patch-hadoop_7682-1.0.x-win

In short: - put patch-hadoop_7682-1.0.x-win.jar in d:\dev\ir\nutch-1.7\lib - edit d:\dev\ir\nutch-1.7\conf\nutch-site.xml by adding the following:

<property>
   <name>fs.file.impl</name>
   <value>com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem</value>
   <description>Enables patch for issue HADOOP-7682 on Windows</description>
</property>

5) Hadoop temp dir - I am not sure if this is necessary (try before applying it), because I added it before applying the patch, but in my d:\dev\ir\nutch-1.7\conf\nutch-site.xml I have

<property>
   <name>hadoop.tmp.dir</name>
   <value>C:\tmp\asd</value>
</property>

6) Hadoop version -I am not sure if this is necessary (try before applying it), I downgraded hadoop to hadoop-core-0.20.205.0.jarbefore I found the patch and it still stays this on my setup. If you find this necessary it is here: http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/0.20.205.0

6.1) Move hadoop-core-1.2.1.jar from d:\dev\ir\nutch-1.7\lib to some location for backup

6.2) Download hadoop-core-0.20.205.0.jar to d:\dev\ir\nutch-1.7\lib

7) Some crawling optimizations. If you need to crawl lots of sites, don't start crawling with a huge list of urls, and big depth and topN. If you do, you'd see that nutch fetches links one at a time from the same site sequentially, waiting a 5 seconds between fetches. The reason is that depth 30 and topN 200 will most possibly fill the first fetch queue only with links from the same site. Nutch won't try to fetch them at once, because by default it is configured not to fetch in several threads from the same site. So you are doomed to wait. A lot.

7.1) To resolve this, first run several crawls with small depth and topN - e.g.

bin/nutch crawl urls -dir crawl -depth 3 -topN 4

This will fill the generated fetch queue with urls from more than one site

7.2) Then you can try a big night's crawl with

bin/nutch crawl urls -dir crawl -depth 20  -topN 150

7.3.) To allow for some multi-threading add the following to yours nutch-site.xml. It will allow several threads fetch from the same host at once.

NOTE! Read the meaning of the properties in internet before using them.

<property>
  <name>fetcher.threads.fetch</name>
  <value>16</value>
</property>
<property>
  <name>fetcher.threads.per.queue</name>
  <value>4</value>
</property>
<property>
<property>
  <name>fetcher.queue.mode</name>
  <value>byDomain</value>
</property>
<name>fetcher.threads.per.host</name>
   <value>8</value>
   <description></description>
</property>
<property>
  <name>fetcher.verbose</name>
  <value>true</value>
</property>
<property>
  <name>fetcher.server.min.delay</name>
  <value>5.0</value>
  <description>applicable ONLY if fetcher.threads.per.host is greater than 1 (i.e. the host blocking is turned off).</description>
</property>
</configuration>

Note: When you crawl lots of sites, make sure your D:\Dev\id\apache-nutch-1.7\conf\regex-urlfilter.txt includes only the sites in which you are interested. Otherwise you'll end up with "The Internet" on your disk.

like image 144
Mateva Avatar answered Sep 20 '22 00:09

Mateva