Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed Web crawling using Apache Spark - Is it Possible?

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?

like image 832
New Man Avatar asked Apr 29 '15 17:04

New Man


People also ask

Which software is used for crawling the website?

UiPath is a robotic process automation software for free web scraping. It automates web and desktop data crawling for most third-party apps. You can install the robotic process automation software if you run it on Windows. UiPath is able to extract tabular and pattern-based data across multiple web pages.

Is Web crawling allowed?

Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.

What are the methods of web crawling?

Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.

Is Web crawling the same as web scraping?

The short answer is that web scraping is about extracting the data from one or more websites. While crawling is about finding or discovering URLs or links on the web.


1 Answers

Spark adds essentially no value to this task.

Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.

Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.

like image 132
Has QUIT--Anony-Mousse Avatar answered Sep 20 '22 20:09

Has QUIT--Anony-Mousse