How to check if all records for a given key are in the same partition already?

2 Answers

Not built-in but if you assume specific partitioner it is easy enough to implement your own function:

import org.apache.spark.rdd.RDD
import org.apache.spark.Partitioner
import scala.reflect.ClassTag

def checkDistribution[K : ClassTag, V : ClassTag](
   rdd: RDD[(K, V)], partitioner: Partitioner) = 
  // If partitioner is set we compare partitioners 
  rdd.partitioner.map(_ == partitioner).getOrElse {
    // Otherwise check if correct number of partitions 
    rdd.partitions.size ==  partitioner.numPartitions &&
    //  And check if distribution matches partitioner
    rdd.keys.mapPartitionsWithIndex((i, iter) => 
      Iterator(iter.forall(x => partitioner.getPartition(x) == i))
    ).fold(true)(_ && _)
  }

A few tests:

import org.apache.spark.HashPartitioner

val rdd = sc.range(0, 20, 5).map((_, None))

Not partitioned, invalid distribution:

checkDistribution(rdd, new HashPartitioner(10))

Boolean = false

Partitioned, invalid partitioner:

checkDistribution(
  rdd.partitionBy(new HashPartitioner(5)),
  new HashPartitioner(10)
)

Boolean = false

Partitioned, valid partitioner:

checkDistribution(
  rdd.partitionBy(new HashPartitioner(10)),
  new HashPartitioner(10)
)

Boolean = true

Not partitioned, valid distribution:

checkDistribution(
  rdd.partitionBy(new HashPartitioner(10)).map(identity),
  new HashPartitioner(10)
)

Boolean = true

Without assuming particular partitioner the only option that comes to mind requires shuffle, so it it unlikely to be an improvement.

def checkDistribution[K : ClassTag, V : ClassTag](rdd: RDD[(K, V)]) =
   rdd.keys.mapPartitionsWithIndex((i, iter) => iter.map((_, i)))
     .combineByKey(
       x => Seq(x), 
       (x: Seq[Int], y: Int) => x, 
       (x: Seq[Int], y: Seq[Int]) => x ++ y)  // Should be more or less OK
     .values
     .mapPartitions(iter => Iterator(iter.forall(_.size == 1)))
     .fold(true)(_ && _)

One possible improvement is that you can use the same logic to automatically define Partitioner for the data. If you collectAsMap before values and check that all Seqs are of size 1 you have a valid partitioner which guarantees no network traffic.

135

answered Jan 03 '23 00:01

zero323

Not 100% what you requested but you can check this by using spark_partition_id. Basically do:

withColumn("pid", spark_partition_id())

and then do:

df.groupby(what you want to check).agg(max($"pid").as("pidmax"),min($"pid").as("pidmin")).filter($"pidmax"===$"pidmin").count()

The count would give you how many elements are not partitioned. Note that this is relatively low cost being a simple aggregation.

I don't believe there is a generic way because if we read from a generic source (e.g. file), we don't necessarily know how the source was originally partitioned.

It would be nice if there was something like "get current partitioner" which would get explicit partitioners (e.g. if we had an explicit repartition command or reading something from parquet which was written using PartitionBy) as an approximation though.

answered Jan 03 '23 00:01

Assaf Mendelson

Related questions
                            
                                How do I submit a Spark jar to a EMR cluster?
                            
                                Where to download documentation for Spark?
                            
                                SparkR Error in sparkR.init(master="local") in RStudio
                            
                                Multiple IP addresses and Host Names used by Spark Driver and Master
                            
                                java.util.concurrent.RejectedExecutionException in Spark although driver/client has precisely same version as Server
                            
                                Writing an RDD to multiple files in PySpark
                            
                                Can sample weight be used in Spark MLlib Random Forest training?
                            
                                Manually stopping Spark Workers
                            
                                Spark Streaming: Broadcast variables, java.lang.ClassCastException
                            
                                How to run custom Python script on Jupyter Notebook launch (to boot Spark)?
                            
                                saveToCassandra with spark-cassandra connector throws java.lang.ClassCastException
                            
                                How to load a PMML model?
                            
                                How to distribute xgboost module for use in spark?
                            
                                how to get two-hop neighbors in spark-graphx?
                            
                                How a Spark executor runs multiple tasks?
                            
                                Pyspark - Sum over multiple sparse vectors (CountVectorizer Output)
                            
                                Can we use SizeEstimator.estimate for estimating size of RDD/DataFrame?
                            
                                Slow Parquet write to HDFS using Spark
                            
                                Spark performance enhancements by storing sorted Parquet files
                            
                                Spark workers stopped after driver commanded a shutdown

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to check if all records for a given key are in the same partition already?

Tags:

apache-spark

Jacek Laskowski

People also ask

2 Answers

zero323

Assaf Mendelson

Recent Activity

Donate For Us