How to control preferred locations of RDD partitions?

1 Answers

Is there a way to set the preferredLocations of RDD partitions manually?

Yes, there is, but it's RDD-specific and so different kinds of RDDs have different ways to do it.

Spark uses RDD.preferredLocations to get a list of preferred locations to compute each partition/split on (e.g. block locations for an HDFS file).

final def preferredLocations(split: Partition): Seq[String]

Get the preferred locations of a partition, taking into account whether the RDD is checkpointed.

As you see the method is final which means that no one can ever override it.

When you look at the source code of RDD.preferredLocations you will see how a RDD knows its preferred locations. It is using the protected RDD.getPreferredLocations method that a custom RDD may (but don't have to) override to specify placement preferences.

protected def getPreferredLocations(split: Partition): Seq[String] = Nil

So, now the question has "morphed" into another about what are the RDDs that allow for setting their preferred locations. Find yours and see the source code.

I'm using an array and the 'Parallelize' method to create a RDD from that.

If you parallelize your local dataset it's no longer distributed and can be such, but...why would you want to use Spark for something you can process locally on a single computer/node?

If however you insist and do really want to use Spark for local datasets, the RDD behind SparkContext.parallelize is...let's have a look at the source code... ParallelCollectionRDD which does allow for location preferences.

Let's then rephrase your question to the following (hoping I won't lose any important fact):

What are the operators that allow for creating a ParallelCollectionRDD and specifying the location preferences explicitly?

To my great surprise (as I didn't know about the feature), there is such an operator, i.e. SparkContext.makeRDD, that...accepts one or more location preferences (hostnames of Spark nodes) for each object.

makeRDD[T](seq: Seq[(T, Seq[String])]): RDD[T] Distribute a local Scala collection to form an RDD, with one or more location preferences (hostnames of Spark nodes) for each object. Create a new partition for each collection item.

In other words, rather than using parallelise you have to use makeRDD (which is available in Spark Core API for Scala, but am not sure about Python that I'm leaving as a home exercise for you :))

The same reasoning I'm applying to any other RDD operator / transformation that creates some sort of RDD.

answered Oct 20 '22 20:10

Jacek Laskowski

Related questions
                            
                                MLlib to Breeze vectors/matrices are private to org.apache.spark.mllib scope?
                            
                                How to use map-function in SPARK with Java
                            
                                How to include file in production mode for Play framework
                            
                                Operation on Data Frame
                            
                                stop-all.sh in Spark sbin/ folder is not stopping all slave nodes
                            
                                How to compute the inverse of a RowMatrix in Apache Spark?
                            
                                system cannot find the path specified in spark-shell
                            
                                Reducing potentially empty RDD's
                            
                                Calculate the mode of a PySpark DataFrame column?
                            
                                How to read specific lines from sparkContext
                            
                                Read file on remote machine in Apache Spark using ftp
                            
                                Scalaz Type Classes for Apache Spark RDDs
                            
                                Scala case class ignoring import in the Spark shell
                            
                                Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?
                            
                                Conditional Join in Spark DataFrame
                            
                                PySpark How to read CSV into Dataframe, and manipulate it
                            
                                Spark program takes a really long time to complete execution
                            
                                How to spark-submit a python file in spark 2.1.0?
                            
                                Why is partition key column missing from DataFrame
                            
                                spark read partitioned data in S3 partly in glacier

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to control preferred locations of RDD partitions?

Tags:

apache-spark

rdd

pyspark

sam93

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us