Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Nutch: Get outlink URL's text context

Anyone knows an efficient way to extract the text context that wraps an outlink URL. For example, given this sample text containing an outlink:

Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. You can download Nutch here. For more information about Apache Nutch, please see the Nutch wiki.

In this example, I would like to get the sentence containing the link, and a sentence before and after that sentence. Any way to do this efficiently? Any methods I can invoke to get something like the position of the link within a fetched content? Or even a part of the nutch code I can modify to do this? Thanks!

like image 457
user3367701 Avatar asked Mar 09 '14 14:03

user3367701


1 Answers

What you want to do is Web Scraping. Python and Hadoop offers tools for that. To achieve it, you can use selectors.

Here you find some examples how to do that using Python Scrapy:

  • Selectors
  • Scrapy Tutorial

On Hadoop the best way to go is to implement a crawling using selectors:

  • Web crawl with Hadoop
  • enter link description here
  • HiveQL

The cascading can be used to address the URL you specify:

  • Hadoop and Cascading

After having the data, you can also use R to optimize analysis:

  • R and Hadoop
  • Enabling R on Hadoop

If you haven't done anything with Hadoop yet, here is a good starting point. You may also want to have a look in HUE Beeswax as an interactive tool that is very useful for data analysis.

like image 143
Avanz Avatar answered Oct 18 '22 11:10

Avanz