Using nutch in Windows 7

Tags:

I am trying to use nutch 1.6 from the windows environment but every time I try to run as per the procedure given in the site Nutch Tuorial Apache I always end up with the following exception:

Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-ajayn\mapred\staging\ajayn-1231695575\.staging to 0700

I have been searching extensively over the net but there is no concrete solution. Please note that I have no hadoop instances installed or running in the system and my sole purpose is the try out nutch as a web crawling agent.

Is it even possible to run nutch 1.6 in windows and if yes any pointers as to how to go about it and avoid the above exception.

PS: if it helps, the /tmp/ folder has a Read Only attribute attached to it and it does not change even if you try to do so. Also from cygwin I have tried to set the file permissions 777, but every time I try to run the nutch instance, a new folder eg: "ajayn-1231695575" is created that does not have any execution rights.

Thanks

Ajay

345

asked Dec 24 '12 07:12

Ajay Nair

1 Answers

Did you try GettingNutchRunningWithWindows from the Nutch Wiki?

Some of my students experimented a lot and here is the result of their work:

Tested with nutch 1.7 - http://www.apache.org/dyn/closer.cgi/nutch/1.7/apache-nutch-1.7-bin.zip You'll also need cygwin.

1) Extract nutch to path without spaces. For example:

 d:\dev\ir\nutch-1.7

2) Copy jdk to some place without spaces. I attempted to make a symlink inside cygwin instead, but it did not go well. For example

xcopy /S "C:\Program Files\Java\jdk1.7.0_21" c:\jdk1.7.0_21

3) In cygwin setup the paths to java

3.1) export JAVA_HOME=/cygdrive/c/jdk1.7.0_21

3.2) export PATH=$JAVA_HOME/bin:$PATH

3,3) Check that all is correct by calling which java. Should return /cygdrive/c/jdk1.7.0_21/bin/java

SO FAR - fixed the first problem - with incorrect java paths. Now to the second problem - hadoop patching.

4) Patch hadoop

https://issues.apache.org/jira/browse/HADOOP-7682
https://github.com/congainc/patch-hadoop_7682-1.0.x-win

In short: - put patch-hadoop_7682-1.0.x-win.jar in d:\dev\ir\nutch-1.7\lib - edit d:\dev\ir\nutch-1.7\conf\nutch-site.xml by adding the following:

<property>
   <name>fs.file.impl</name>
   <value>com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem</value>
   <description>Enables patch for issue HADOOP-7682 on Windows</description>
</property>

5) Hadoop temp dir - I am not sure if this is necessary (try before applying it), because I added it before applying the patch, but in my d:\dev\ir\nutch-1.7\conf\nutch-site.xml I have

<property>
   <name>hadoop.tmp.dir</name>
   <value>C:\tmp\asd</value>
</property>

6) Hadoop version -I am not sure if this is necessary (try before applying it), I downgraded hadoop to hadoop-core-0.20.205.0.jarbefore I found the patch and it still stays this on my setup. If you find this necessary it is here: http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/0.20.205.0

6.1) Move hadoop-core-1.2.1.jar from d:\dev\ir\nutch-1.7\lib to some location for backup

6.2) Download hadoop-core-0.20.205.0.jar to d:\dev\ir\nutch-1.7\lib

7) Some crawling optimizations. If you need to crawl lots of sites, don't start crawling with a huge list of urls, and big depth and topN. If you do, you'd see that nutch fetches links one at a time from the same site sequentially, waiting a 5 seconds between fetches. The reason is that depth 30 and topN 200 will most possibly fill the first fetch queue only with links from the same site. Nutch won't try to fetch them at once, because by default it is configured not to fetch in several threads from the same site. So you are doomed to wait. A lot.

7.1) To resolve this, first run several crawls with small depth and topN - e.g.

bin/nutch crawl urls -dir crawl -depth 3 -topN 4

This will fill the generated fetch queue with urls from more than one site

7.2) Then you can try a big night's crawl with

bin/nutch crawl urls -dir crawl -depth 20  -topN 150

7.3.) To allow for some multi-threading add the following to yours nutch-site.xml. It will allow several threads fetch from the same host at once.

NOTE! Read the meaning of the properties in internet before using them.

<property>
  <name>fetcher.threads.fetch</name>
  <value>16</value>
</property>
<property>
  <name>fetcher.threads.per.queue</name>
  <value>4</value>
</property>
<property>
<property>
  <name>fetcher.queue.mode</name>
  <value>byDomain</value>
</property>
<name>fetcher.threads.per.host</name>
   <value>8</value>
   <description></description>
</property>
<property>
  <name>fetcher.verbose</name>
  <value>true</value>
</property>
<property>
  <name>fetcher.server.min.delay</name>
  <value>5.0</value>
  <description>applicable ONLY if fetcher.threads.per.host is greater than 1 (i.e. the host blocking is turned off).</description>
</property>
</configuration>

Note: When you crawl lots of sites, make sure your D:\Dev\id\apache-nutch-1.7\conf\regex-urlfilter.txt includes only the sites in which you are interested. Otherwise you'll end up with "The Internet" on your disk.

144

answered Sep 20 '22 00:09

Mateva

Related questions
                            
                                How to configure ToolStripDropdownbutton size to match size of the parent dropdown button?
                            
                                Vim start-up extremely slow under Cygwin when fugitive.vim plugin is installed
                            
                                How can a disconnected TCP socket be reliably detected using MsgWaitForMultipleObjects?
                            
                                Getting network info in Java on Windows and Linux platforms
                            
                                Simple console program will not exit if cudaMalloc is called
                            
                                Mouse emulation in a different program
                            
                                Find all threads of a process given process id
                            
                                How does windows assign Display Device names? (eg \\.\DISPLAY1) and determine display ports?
                            
                                Java Awt Robot changes Windows Mouse Speed
                            
                                Getting the GPU connection type in Windows XP?
                            
                                C# Windows Forms does not open default browser after installation
                            
                                Windows 'choice' command messing up Ruby 'gets' method
                            
                                Event log message size 31885? Windows 2008
                            
                                On which Windows systems are Command Extensions disabled by default?
                            
                                Title-bar's context menu [duplicate]
                            
                                How Do I Get User Input from a TextField and Convert it to a Double?
                            
                                Execute method in running program instead of launching a new instance
                            
                                Windows Azure sdk 1.7, CloudTableClient missing GetTableReference function
                            
                                Why object_id persists on windows while not on linux?
                            
                                Is there a way to tell if a mapped drive was mapped as an Administrator or Standard User With C#?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using nutch in Windows 7

Tags:

windows

windows-7

cygwin

nutch

Ajay Nair

People also ask

1 Answers

Mateva

Recent Activity

Donate For Us