Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

Tags:

I guess I'm not yet fully understanding how Spark works.

Here is my setup:

I'm running a Spark cluster in Standalone mode. I'm using 4 machines for this: One is the Master, the other three are Workers.

I have written an application that reads data from a Cassandra cluster (see https://github.com/journeymonitor/analyze/blob/master/spark/src/main/scala/SparkApp.scala#L118).

The 3-nodes Cassandra cluster runs on the same machines that also host the Spark Worker nodes. The Spark Master node does not run a Cassandra node:

Machine 1      Machine 2        Machine 3        Machine 4
Spark Master   Spark Worker     Spark Worker     Spark Worker
               Cassandra node   Cassandra node   Cassandra node

The reasoning behind this is that I want to optimize data locality - when running my Spark app on the cluster, each Worker only needs to talk to its local Cassandra node.

Now, when submitting my Spark app to the cluster by running spark-submit --deploy-mode client --master spark://machine-1 from Machine 1 (the Spark Master), I expect the following:

a Driver instance is started on the Spark Master
the Driver starts one Executor on each Spark Worker
the Driver distributes my application to each Executor
my application runs on each Executor, and from there, talks to Cassandra via 127.0.0.1:9042

However, this doesn't seem to be the case. Instead, the Spark Master tries to talk to Cassandra (and fails, because there is no Cassandra node on the Machine 1 host).

What is it that I misunderstand? Does it work differently? Does in fact the Driver read the data from Cassandra, and distribute the data to the Executors? But then I could never read data larger than memory of Machine 1, even if the total memory of my cluster is sufficient.

Or, does the Driver talk to Cassandra not to read data, but to find out how to partition the data, and instructs the Executors to read "their" part of the data?

If someone can enlight me, that would be much appreciated.

888

asked Nov 24 '15 15:11

Manuel Kießling

1 Answers

Driver program is responsible for creating SparkContext, SQLContext and scheduling tasks on the worker nodes. It includes creating logical and physical plans and applying optimizations. To be able to do that it has to have access to the data source schema and possible other informations like schema or different statistics. Implementation details vary from source to source but generally speaking it means that data should be accessible on all nodes including application master.

At the end of the day your expectations are almost correct. Chunks of the data are fetched individually on each worker without going through driver program, but driver has to be able to connect to Cassandra to fetch required metadata.

139

answered Oct 17 '22 01:10

zero323

Related questions
                            
                                error while loading CharSequence in Scala 2.11.4 and sbt 0.12.4
                            
                                Parsing and manipulating json in Scala
                            
                                Cannot find JsonWriter or JsonFormat type class for a case class
                            
                                reduceByKey using Scala object as key
                            
                                spray Collection ToResponseMarshallable
                            
                                IntelliJ source code editor shows false compilation errors
                            
                                Access Request Body in essential filter Play Framework 2
                            
                                launching a spark program using oozie workflow
                            
                                How to add a github java dependency in sbt config?
                            
                                scala converting Array[String] to case class
                            
                                Ordering an RDD[String]
                            
                                value ~ is not a member of slick.lifted.Rep[Option[Int]]
                            
                                Can I solve it with Shapeless?
                            
                                Scala instantiate objects from String classname
                            
                                How to get Scala function's parameters / return type?
                            
                                How to create collection of RDDs out of RDD?
                            
                                Slick 3.0.0 - How to sortBy on a query with joinLeft
                            
                                How to work with and query dynamic column families in Phantom for Cassandra?
                            
                                Spark Streaming on EC2: Exception in thread "main" java.lang.ExceptionInInitializerError
                            
                                Scala Constructor Confusion - please clarify

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark: Driver (instead of just the Executors) tries to connect to Cassandra

Tags:

scala

cassandra

apache-spark

Manuel Kießling

People also ask

1 Answers

zero323

Recent Activity

Donate For Us