RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by <code>sortBy()</code>, as explained in this reply. Now, which operations preserve that order? E.g., is it guaranteed that (after <code>a.sortBy()</code>) <pre class="prettyprint"><code>a.map(f).zip(a) === a.map(x => (f(x),x)) </code></pre> How about <pre class="prettyprint"><code>a.filter(f).map(g) === a.map(x => (x,g(x))).filter(f(_._1)).map(_._2) </code></pre> what about <pre class="prettyprint"><code>a.filter(f).flatMap(g) === a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2) </code></pre> Here "equality" <code>===</code> is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).

All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a <code>sortBy</code>. For example, if you read a file (<code>sc.textFile</code>) the lines of the RDD will be in the order that they were in the file. Without trying to give a complete list, <code>map</code>, <code>filter</code> and <code>flatMap</code> do preserve the order. <code>sortBy</code>, <code>partitionBy</code>, <code>join</code> do not preserve the order. The reason is that most RDD operations work on <code>Iterator</code>s inside the partitions. So <code>map</code> or <code>filter</code> just has no way to mess up the order. You can take a look at the code to see for yourself. You may now ask: What if I have an RDD with a <code>HashPartitioner</code>. What happens when I use <code>map</code> to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use <code>partitionBy</code> to restore the partitioning with a shuffle.

Which operations preserve RDD order?

Tags:

apache-spark

rdd

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply.

Now, which operations preserve that order?

E.g., is it guaranteed that (after a.sortBy())

a.map(f).zip(a) ===  a.map(x => (f(x),x))

How about

a.filter(f).map(g) ===  a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)

what about

a.filter(f).flatMap(g) ===  a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)

Here "equality" === is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).

903

asked Mar 26 '15 16:03

sds

1 Answers

All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a sortBy. For example, if you read a file (sc.textFile) the lines of the RDD will be in the order that they were in the file.

Without trying to give a complete list, map, filter and flatMap do preserve the order. sortBy, partitionBy, join do not preserve the order.

The reason is that most RDD operations work on Iterators inside the partitions. So map or filter just has no way to mess up the order. You can take a look at the code to see for yourself.

You may now ask: What if I have an RDD with a HashPartitioner. What happens when I use map to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use partitionBy to restore the partitioning with a shuffle.

155

answered Sep 26 '22 02:09

Daniel Darabos

Related questions
                            
                                Spark sql how to explode without losing null values
                            
                                DataFrame partitionBy to a single Parquet file (per partition)
                            
                                What is yarn-client mode in Spark?
                            
                                SparkR vs sparklyr [closed]
                            
                                Derive multiple columns from a single column in a Spark DataFrame
                            
                                What conditions should cluster deploy mode be used instead of client?
                            
                                View RDD contents in Python Spark?
                            
                                Spark load data and add filename as dataframe column
                            
                                Convert date from String to Date format in Dataframes
                            
                                PySpark: multiple conditions in when clause
                            
                                Find maximum row per group in Spark DataFrame
                            
                                Append a column to Data Frame in Apache Spark 1.3
                            
                                Pyspark replace strings in Spark dataframe column
                            
                                Explain the aggregate functionality in Spark (with Python and Scala)
                            
                                How do I detect if a Spark DataFrame has a column
                            
                                Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?
                            
                                Difference between == and === in Scala, Spark
                            
                                'PipelinedRDD' object has no attribute 'toDF' in PySpark
                            
                                Pyspark: Pass multiple columns in UDF
                            
                                Importing spark.implicits._ in scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With