very simple questions: in which cases should I prefer Hadoop MapReduce over Spark? (I hope this question has not been asked yet - at least I didn't find it...)
I am currently doing a comparison of those two processing frameworks and from what I have read so far, everybody seems to suggest to use Spark. Does that also conform to your experience? Or can you name use cases where MapReduce performes better then Spark?
Would I need more ressources (esp. RAM) for the same task with Spark then I would need for MapReduce?
Thanks and regards!
Spark is a great improvement over traditional MapReduce.
When would you use MapReduce over Spark?
When you have a legacy program written in the MapReduce paradigm that is so complex that you do not want to reprogram it. Also if your problem is not about analyzing data then Spark might not be right for you. One example I can think of is for web crawling, there is a great Apache project called Apache Nutch, that is built on Hadoop not Spark.
When would I use Spark over MapReduce?
Ever since 2012... Ever since I started using Spark I haven't wanted to go back. It has also been a great motivation to expand my knowledge beyond Java and to learn Scala. A lot of the operations in Spark take less characters to complete. Also, using Scala/REPL is so much better to rapidly produce code. Hadoop has Pig, but then you have to learn "Pig Latin", which will never be useful anywhere else...
If you want to use Python Libs in your data analysis, I find it easier to get Python working with Spark, and MapReduce. I also REALLY like using something like IPython Notebook. As much as Spark learned me to learn Scala when I started, using IPython Notebook with Spark motivated me to learn PySpark. It doesn't have all the functionality, but most of it can be made up for with Python packages.
Spark also now features Spark SQL, which is backwardly compatible with Hive. This lets you use Spark, to run close to SQL queries. I think this is much better then trying to learn HiveQL, which is different enough that everything is specific to it. With Spark SQL, you can usually get away with using general SQL advice to solve issues.
Lastly, Spark also has MLLib, for machine learning, which is a great improvement over Apache Mahout.
Largest Spark issue: the internet is not full of troubleshooting tips. Since Spark is new, the documentation about issues is a little lacking... It's a good idea to buddy up with someone from AmpLabs/Databricks (the creators of Spark from UC Berkeley, and their consulting business), and utilize their forums for support.
You should prefer Hadoop Map Reduce over Spark if
On other front, Spark’s major use cases over Hadoop
Have a look at this blog and dezyre blog
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With