Distributed Web crawling using Apache Spark - Is it Possible?

Tags:

An interesting question asked of me when I attended one interview regarding web mining. The question was, is it possible to crawl the Websites using Apache Spark?

I guessed that it was possible, because it supports distributed processing capacity of Spark. After the interview I searched for this, but couldn't find any interesting answer. Is that possible with Spark?

832

asked Apr 29 '15 17:04

New Man

1 Answers

Spark adds essentially no value to this task.

Sure, you can do distributed crawling, but good crawling tools already support this out of the box. The datastructures provided by Spark such as RRDs are pretty much useless here, and just to launch crawl jobs, you could just use YARN, Mesos etc. directly at less overhead.

Sure, you could do this on Spark. Just like you could do a word processor on Spark, since it is turing complete... but it doesn't get any easier.

132

answered Sep 20 '22 20:09

Has QUIT--Anony-Mousse

Related questions
                            
                                How to invoke phone dialer in browser? or is it impossible?
                            
                                How to exclude / disable Redux devtools in production build or disconnect?
                            
                                Retreving current domain in Angular model (angular service where ajax calls reside) to prepare full api url to retrieve data
                            
                                Extracting rootdomains from URL string in Google Sheets
                            
                                Yii2 - The directory is not writable by the Web process | frontend/web/assets
                            
                                Get the site status - up or down
                            
                                WebRequest fails to download large files (~ 1 GB) properly
                            
                                How to rename angular app?
                            
                                A cookie header was received that contained an invalid cookie.
                            
                                My site is infected with obfuscated PHP malware - what is it doing + how do I get rid of it?
                            
                                Is it possible to have Lync communicate with a REST API?
                            
                                Failed to load resource: Request timed out on Safari
                            
                                Creating a two-column-100% layout with Bootstrap
                            
                                What is the proper way to localize a static website
                            
                                firebase.auth().onAuthStateChanged Not Working
                            
                                Python import web not working
                            
                                How to scroll to li element in a ul list
                            
                                How to simulate mouse click on blank area in website by Selenium IDE?
                            
                                How to duplicate a request using wget (or curl) with raw headers?
                            
                                How to forward a subdomain to a new port on the same IP address using Apache? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Distributed Web crawling using Apache Spark - Is it Possible?

Tags:

web

apache-spark

web-crawler

New Man

People also ask

1 Answers

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us