Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the html content from nutch

Tags:

nutch

Is there is any way to get the html content of each webpage in nutch while crawling the web page?

like image 594
ragaa Avatar asked Feb 25 '11 23:02

ragaa


People also ask

Is Apache Nutch open source?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

What is nutch indexing?

Index writers in NutchAn index writer is a component of the indexing job, which is used for sending documents from one or more segments to an external server. In Nutch, these components are found as plugins. Nutch includes these out-of-the-box indexers: Indexer.

What is Nutch search engine project?

1.1 What is Nutch? Nutch is an effort to build a Free and Open Source search engine. It uses Lucene for the search and index component. The fetcher (robot) has been written from scratch solely for this project.


2 Answers

Yes, you can acutally export the content of the crawled segments. It is not straightforward, but it works well for me. First, create a java project with the following code:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.NutchConfiguration;

import java.io.File;
import java.io.FileOutputStream;

public class NutchSegmentOutputParser {

public static void main(String[] args) {

    if (args.length != 2) {
        System.out.println("usage: segmentdir (-local | -dfs <namenode:port>) outputdir");
        return;
    }

    try {
        Configuration conf = NutchConfiguration.create();
        FileSystem fs = FileSystem.get(conf);


        String segment = args[0];

        File outDir = new File(args[1]);
        if (!outDir.exists()) {
            if (outDir.mkdir()) {
                System.out.println("Creating output dir " + outDir.getAbsolutePath());
            }
        }

        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);


        Text key = new Text();
        Content content = new Content();

        while (reader.next(key, content)) {
            String filename = key.toString().replaceFirst("http://", "").replaceAll("/", "___").trim();

            File f = new File(outDir.getCanonicalPath() + "/" + filename);
            FileOutputStream fos = new FileOutputStream(f);
            fos.write(content.getContent());
            fos.close();
            System.out.println(f.getAbsolutePath());
        }
        reader.close();
        fs.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

}

}

I recommend using Maven; add the following dependencies:

     <dependency>
      <groupId>org.apache.nutch</groupId>
        <artifactId>nutch</artifactId>
        <version>1.5.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>0.23.1</version>
    </dependency>

and create a jar package (i.e. NutchSegmentOutputParser.jar)

You need Hadoop to be installed on your machine. Then run:

$/hadoop-dir/bin/hadoop --config \
NutchSegmentOutputParser.jar:~/.m2/repository/org/apache/nutch/nutch/1.5.1/nutch-1.5.1.jar \
NutchSegmentOutputParser nutch-crawled-dir/2012xxxxxxxxx/ outdir

where nutch-crawled-dir/2012xxxxxxxxx/ is the crawled directory you want to extract content from (it contains 'segment' subdirectory) and outdir is an output dir. The output file names are generated from URI, however, the slashes are replaced by "_".

Hope it helps.

like image 145
Habi Avatar answered Oct 16 '22 10:10

Habi


Try this:

public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags
 metaTags, DocumentFragment doc) 
{
 Parse parse = parseResult.get(content.getUrl());
 LOG.info("parse.getText: " +parse.getText());
 return parseResult;
}

Then check the content in hadoop.log.

like image 1
Haya aziz Avatar answered Oct 16 '22 10:10

Haya aziz