Apache Nutch: Get outlink URL's text context

Question

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You can download Nutch here. For more information about Apache Nutch, please see the Nutch wiki.

In this example, I would like to get the sentence containing the link, and a sentence before and after that sentence. Any way to do this efficiently? Any methods I can invoke to get something like the position of the link within a fetched content? Or even a part of the nutch code I can modify to do this? Thanks!

Avanz · Accepted Answer

What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.

Here you find some examples how to do that using Python Scrapy:

Selectors
Scrapy Tutorial

On Hadoop the best way to go is to implement a crawling using selectors:

Web crawl with Hadoop
enter link description here
HiveQL

The cascading can be used to address the URL you specify:

Hadoop and Cascading

After having the data, you can also use R to optimize analysis:

R and Hadoop
Enabling R on Hadoop

If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.

Apache Nutch: Get outlink URL's text context

Tags:

apache

web-scraping

hadoop

nutch

user3367701

1 Answers

Avanz

Recent Activity

Donate For Us

Apache Nutch: Get outlink URL's text context

Tags:

apache

web-scraping

hadoop

nutch

user3367701

1 Answers

Avanz

Related questions

Recent Activity

Donate For Us