Is a Spark RDD deterministic for the set of elements in each partition?

Tags:

I can't find much documentation on ensuring partitioning order - i just want to ensure that given a set of deterministic transformations (output rows always the same), partitions always receive the same set of elements if the underlying dataset doesn't change. Is that possible?

It doesn't need to be sorted: an example would be after a set of transformations are applied on an RDD, it looks like this now -> (A, B, C, D, E, F, G)

And if my spark.default.parallelism were 2 or 3, the set of elements would always be either: (A, B, C, D), (E, F, G) or (A, B), (C, D), (E, F, G) respectively.

This is because i have to cause my executors will be causing some side effects based on the partition/set of elements it is operating on, and i want to make sure that the Spark application is idempotent. (same side effect if it restarts)

Edit: Apparently, DF repartition is deterministic but RDD partition is not (Spark 2.4.4).

def f1(rdds):
    rows = list(rdds)
    stats_summary = [{
        'origin': str(row['origin']),
        'dest': str(row['dest']),
        'start_time': analysis_date.isoformat(),
        'value': row['count']
    } for row in rows]

    stats_summary.sort(key=lambda t: (t['start_time'], t['origin'], t['dest']))

    rtn = "partition size: {}, first: ({}, {}), last: ({}, {})".format(
        len(rows), 
        stats_summary[0]["origin"], stats_summary[0]["dest"],
        stats_summary[-1]["origin"], stats_summary[-1]["dest"])
    return [rtn]

repartition_rdd_res = unq_statistics.rdd \
                                    .repartition(10) \
                                    .mapPartitions(f1) \
                                    .collect()

repartition_df_res = unq_statistics.repartition(10) \
                                   .rdd \
                                   .mapPartitions(f1) \
                                   .collect()

repartition_rdd_res4 = ['partition size: 131200, first: (-1, -1), last: (999, -1)',
 'partition size: 131209, first: (-1, 1014), last: (996, 996)',
 'partition size: 131216, first: (-1, 1021), last: (999, 667)',
 'partition size: 131218, first: (-1, 1008), last: (991, 1240)',
 'partition size: 131222, first: (-1, 1001), last: (994, 992)',
 'partition size: 131229, first: (-1, 1007), last: (994, 890)',
 'partition size: 131233, first: (-1, 1004), last: (991, -1)',
 'partition size: 131235, first: (-1, 1005), last: (999, 1197)',
 'partition size: 131237, first: (-1, 100), last: (999, 997)',
 'partition size: 131240, first: (-1, 1010), last: (994, -1)']

repartition_rdd_res3 = ['partition size: 131200, first: (-1, -1), last: (999, -1)',
 'partition size: 131209, first: (-1, 1006), last: (994, 2048)',
 'partition size: 131216, first: (-1, 1002), last: (996, 996)',
 'partition size: 131218, first: (-1, 1017), last: (999, 667)',
 'partition size: 131222, first: (-1, 1008), last: (994, 890)',
 'partition size: 131229, first: (-1, 1000), last: (99, 96)',
 'partition size: 131233, first: (-1, 1001), last: (994, 992)',
 'partition size: 131235, first: (-1, 1009), last: (990, 1601)',
 'partition size: 131237, first: (-1, 1004), last: (994, -1)',
 'partition size: 131240, first: (-1, 1003), last: (999, 997)']

repartition_rdd_res2 = ['partition size: 131200, first: (-1, 1013), last: (991, 2248)',
 'partition size: 131209, first: (-1, 1007), last: (999, 667)',
 'partition size: 131216, first: (-1, 100), last: (99, 963)',
 'partition size: 131218, first: (-1, 1002), last: (999, 997)',
 'partition size: 131222, first: (-1, 101), last: (996, 996)',
 'partition size: 131229, first: (-1, -1), last: (991, 1240)',
 'partition size: 131233, first: (-1, 1006), last: (999, 1197)',
 'partition size: 131235, first: (-1, 1001), last: (994, 992)',
 'partition size: 131237, first: (-1, 1019), last: (999, -1)',
 'partition size: 131240, first: (-1, 1017), last: (991, -1)']

repartition_df_res2 = ['partition size: 131222, first: (-1, 1023), last: (996, 996)',
 'partition size: 131223, first: (-1, 1003), last: (999, 667)',
 'partition size: 131223, first: (-1, 1012), last: (990, 990)',
 'partition size: 131224, first: (-1, -1), last: (999, 1558)',
 'partition size: 131224, first: (-1, 100), last: (99, 98)',
 'partition size: 131224, first: (-1, 1008), last: (99, 968)',
 'partition size: 131224, first: (-1, 1018), last: (999, 997)',
 'partition size: 131225, first: (-1, 1006), last: (994, 992)',
 'partition size: 131225, first: (-1, 101), last: (990, 935)',
 'partition size: 131225, first: (-1, 1013), last: (999, 1197)']

708

asked Jan 10 '20 11:01

KRC

Video Answer

2 Answers

Lets look at the source, and specifically its shuffle part:

...
if (shuffle) {
  /** Distributes elements evenly across output partitions, starting from a random partition. */
  val distributePartition = (index: Int, items: Iterator[T]) => {
    var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
    items.map { t =>
      // Note that the hash code of the key will just be the key itself. The HashPartitioner
      // will mod it with the number of total partitions.
      position = position + 1
      (position, t)
    }
  } : Iterator[(Int, T)]
  ...

As you can see the distribution of elements from a given source partition N into X target partitions is a simple increment (later modulo'ed by X) starting from some number which depends only on that N, and hence pre-determined. So if your source RDD is unchanged, the result of repartition(X) should be the same every time as well.

answered Oct 27 '22 00:10

mazaneicha

Internally, Spark uses a default partitioner(HashPartitioner depending on the data) to partition the data, which uses hash to identify which partition that item belongs to. Thus, you can say that the data item will always go to the same partition given that the partition count is same, because if partition count is change, it will effect the hash as well.

answered Oct 26 '22 23:10

Waqar Ahmed

Related questions
                            
                                Spark: grouping rows in array by key
                            
                                Converting mysql table to spark dataset is very slow compared to same from csv file
                            
                                Pyspark: cast array with nested struct to string
                            
                                Modify spark DataFrame column
                            
                                Select columns that satisfy a condition
                            
                                How to convert unix timestamp to the given timezone with Spark
                            
                                Why does spark-ml ALS model returns NaN and negative numbers predictions?
                            
                                Apply custom function to cells of selected columns of a data frame in PySpark
                            
                                Spark SQL - reading csv with schema
                            
                                Combine multiple raw files into single parquet file
                            
                                Spark writing/reading to/from S3 - Partition Size and Compression
                            
                                Authentication for Spark standalone cluster
                            
                                split a Spark column of Array[String] into columns of String
                            
                                Pickling monkey-patched Keras model for use in PySpark
                            
                                Retain raw JSON as column in Spark DataFrame on read/load?
                            
                                Why do I get so many empty partitions when repartionning a Spark Dataframe?
                            
                                Apache Spark vs Spring Cloud data flow [closed]
                            
                                Error running spark on databricks: constructor public XXX is not whitelisted
                            
                                Pass additional arguments to foreachBatch in pyspark
                            
                                How to remove elements from an array Column in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is a Spark RDD deterministic for the set of elements in each partition?

Tags:

apache-spark

rdd

persistence

KRC

People also ask

Video Answer

2 Answers

mazaneicha

Waqar Ahmed

Recent Activity

Donate For Us