I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception:
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-ajayn\mapred\staging\ajayn-1231695575\.staging to 0700
I have been searching extensively over the net but there is no concrete solution. Please note that I have no hadoop instances installed or running in the system and my sole purpose is the try out nutch as a web crawling agent.
Is it even possible to run nutch 1.6 in windows and if yes any pointers as to how to go about it and avoid the above exception.
PS: if it helps, the /tmp/ folder has a Read Only attribute attached to it and it does not change even if you try to do so. Also from cygwin I have tried to set the file permissions 777, but every time I try to run the nutch instance, a new folder eg: "ajayn-1231695575" is created that does not have any execution rights.
Thanks
Ajay
Option 1: Setup Nutch from a binary distribution Unzip your binary Nutch package. There should be a folder apache-nutch-1. X . From now on, we are going to use ${NUTCH_RUNTIME_HOME } to refer to the current directory ( apache-nutch-1.
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Nutch is an open source crawler which provides the Java library for crawling, indexing and database storage. Solr is an open source search platform which provides full-text search and integration with Nutch. The following contents are steps of setting up Nutch and Solr for crawling and searching.
Did you try GettingNutchRunningWithWindows from the Nutch Wiki?
Some of my students experimented a lot and here is the result of their work:
Tested with nutch 1.7 - http://www.apache.org/dyn/closer.cgi/nutch/1.7/apache-nutch-1.7-bin.zip
You'll also need cygwin.
1) Extract nutch to path without spaces. For example:
d:\dev\ir\nutch-1.7
2) Copy jdk to some place without spaces. I attempted to make a symlink inside cygwin instead, but it did not go well. For example
xcopy /S "C:\Program Files\Java\jdk1.7.0_21" c:\jdk1.7.0_21
3) In cygwin setup the paths to java
3.1) export JAVA_HOME=/cygdrive/c/jdk1.7.0_21
3.2) export PATH=$JAVA_HOME/bin:$PATH
3,3) Check that all is correct by calling which java. Should return /cygdrive/c/jdk1.7.0_21/bin/java
SO FAR - fixed the first problem - with incorrect java paths. Now to the second problem - hadoop patching.
4) Patch hadoop
https://issues.apache.org/jira/browse/HADOOP-7682
https://github.com/congainc/patch-hadoop_7682-1.0.x-win
In short:
- put patch-hadoop_7682-1.0.x-win.jar
in d:\dev\ir\nutch-1.7\lib
- edit d:\dev\ir\nutch-1.7\conf\nutch-site.xml
by adding the following:
<property>
<name>fs.file.impl</name>
<value>com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem</value>
<description>Enables patch for issue HADOOP-7682 on Windows</description>
</property>
5) Hadoop temp dir - I am not sure if this is necessary (try before applying it), because I added it before applying the patch, but in my d:\dev\ir\nutch-1.7\conf\nutch-site.xml
I have
<property>
<name>hadoop.tmp.dir</name>
<value>C:\tmp\asd</value>
</property>
6) Hadoop version -I am not sure if this is necessary (try before applying it), I downgraded hadoop to hadoop-core-0.20.205.0.jarbefore I found the patch and it still stays this on my setup.
If you find this necessary it is here: http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/0.20.205.0
6.1) Move hadoop-core-1.2.1.jar
from d:\dev\ir\nutch-1.7\lib
to some location for backup
6.2) Download hadoop-core-0.20.205.0.jar
to d:\dev\ir\nutch-1.7\lib
7) Some crawling optimizations. If you need to crawl lots of sites, don't start crawling with a huge list of urls, and big depth and topN. If you do, you'd see that nutch fetches links one at a time from the same site sequentially, waiting a 5 seconds between fetches. The reason is that depth 30 and topN 200 will most possibly fill the first fetch queue only with links from the same site. Nutch won't try to fetch them at once, because by default it is configured not to fetch in several threads from the same site. So you are doomed to wait. A lot.
7.1) To resolve this, first run several crawls with small depth and topN - e.g.
bin/nutch crawl urls -dir crawl -depth 3 -topN 4
This will fill the generated fetch queue with urls from more than one site
7.2) Then you can try a big night's crawl with
bin/nutch crawl urls -dir crawl -depth 20 -topN 150
7.3.) To allow for some multi-threading add the following to yours nutch-site.xml
. It will allow several threads fetch from the same host at once.
NOTE! Read the meaning of the properties in internet before using them.
<property>
<name>fetcher.threads.fetch</name>
<value>16</value>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>4</value>
</property>
<property>
<property>
<name>fetcher.queue.mode</name>
<value>byDomain</value>
</property>
<name>fetcher.threads.per.host</name>
<value>8</value>
<description></description>
</property>
<property>
<name>fetcher.verbose</name>
<value>true</value>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>5.0</value>
<description>applicable ONLY if fetcher.threads.per.host is greater than 1 (i.e. the host blocking is turned off).</description>
</property>
</configuration>
Note: When you crawl lots of sites, make sure your D:\Dev\id\apache-nutch-1.7\conf\regex-urlfilter.txt
includes only the sites in which you are interested. Otherwise you'll end up with "The Internet" on your disk.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With