I am new with Spark and I wanted to ask some common guidelines about developing and testing my code for Apache Spark framework
What is the most common setup to test my code locally? Is there any built VM to raise (ready box etc.)? Do I have to setup locally spark? Is there any test library to test my code?
When going in cluster mode I notice that there are some ways to setup your cluster; production wise, what is the most common way to setup a cluster to run Spark? Three options here
Thank you
Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets, and can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools.
Objective – Spark Careers As we all know, big data analytics have a fresh new face, Apache Spark. Basically, the Spark's significance and share are continuously increasing across organizations. Hence, there are ample of career opportunities in spark.
Spark is written in Scala Scala is not only Spark's programming language, but it's also scalable on JVM. Scala makes it easy for developers to go deeper into Spark's source code to get access and implement all the framework's newest features.
1) Common setup: Just download the Spark version on a local machine. Unzip it and follow these steps to set it up locally.
2) Launching a cluster for production: The Spark cluster mode overview available here explains the key concepts when running a Spark cluster. Spark can be run both in a standalone way and on several existing cluster managers. Currently, several deployments options are available:
Amazon EC2
Standalone mode
Apache Mesos
Hadoop YARN
EC2 scripts let you launch a cluster in about 5 minutes. In fact, if you are using EC2, the best way to go is using the script provided by spark. The standalone mode is the best for the deployment of Spark on a private cluster.
Normally, we use YARN as cluster manager when we have an existing Hadoop setup with YARN, and the same goes for Mesos. Instead, if you are creating a new cluster from the dust, I would recommend using the Standalone mode, considering you are not using Amazon's EC2 instances. This link shows some steps that help arranging a Standalone Spark cluster.
Sandbox from Hortonworks hope will help.
HDP 2.2.4 Sandbox with Apache Spark & Ambari Views http://hortonworks.com/products/hortonworks-sandbox/#install
Second resource I'm using is http://www.cloudera.com/downloads/quickstart_vms/5-8.html
The image does contain Hadoop, HBase, Impala, Spark and many more features. Does require 4gb RAM, 1 CPU and 62.5GB disk. Kind large but is free and does fulfill all requirements rather paid versions on cloud.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With