Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Spark 1.6 does not use Akka?

Tags:

When I read spark-1.6 source code of the Master class, the receiveAndReply method seems not to be using Akka. [Cf. here.]

Why is it not using Akka ? And What did they replace Akka with ?

like image 289
高源伯 Avatar asked May 25 '16 22:05

高源伯


People also ask

Does spark use Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

How does Apache Spark work?

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

When would you not want to use spark?

When would you not want to use Spark? For multi-user systems, with shared memory, Hive may be a better choice ². For real time, low latency processing, you may prefer Apache Kafka ⁴. With small data sets, it’s not going to give you huge gains, so you’re probably better off with the typical libraries and tools.

What is Apache Spark and how does it work?

Spark has been called a “general purpose distributed data processing engine”1 and “a lightning fast unified analytics engine for big data and machine learning” ². It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources.

What is spark and why should you care?

What is Spark? Spark has been called a “general purpose distributed data processing engine”1 and “a lightning fast unified analytics engine for big data and machine learning” ². It lets you process big data sets faster by splitting the work up into chunks and assigning those chunks across computational resources.

What are some common options to set in spark?

Some of the most common options to set are: The name of your application. This will appear in the UI and in log data. Number of cores to use for the driver process, only in cluster mode. Limit of total size of serialized results of all partitions for each Spark action (e.g. collect).


1 Answers

The motivation behind making Spark independent from Akka are well described in SPARK-5293 which is an umbrella task for Akka related issues.

To quote original description:

Spark depends on Akka, [so] it is not possible for users to rely on different versions, and we have received many requests in the past asking for help about this specific issue. For example, Spark Streaming might be used as the receiver of Akka messages - but our dependency on Akka requires the upstream Akka actors to also use the identical version of Akka.

Since our usage of Akka is limited (mainly for RPC and single-threaded event loop), we can replace it with alternative RPC implementations and a common event loop in Spark.

As you can see, the main reason is simple - to give users more flexibility in creating their own applications.

Also removing complex dependency like Akka, which hasn't been used extensively by Spark anyway, which means lower cost of maintenance.

like image 131
zero323 Avatar answered Oct 31 '22 17:10

zero323