RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy()
, as explained in this reply.
Now, which operations preserve that order?
E.g., is it guaranteed that (after a.sortBy()
)
a.map(f).zip(a) === a.map(x => (f(x),x))
How about
a.filter(f).map(g) === a.map(x => (x,g(x))).filter(f(_._1)).map(_._2)
what about
a.filter(f).flatMap(g) === a.flatMap(x => g(x).map((x,_))).filter(f(_._1)).map(_._2)
Here "equality" ===
is understood as "functional equivalence", i.e., there is no way to distinguish the outcome using user-level operations (i.e., without reading logs &c).
On rdds with a key, there're dedicated operations that preserve the order pairRDD.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Two types of Apache Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed.
The RDDs store data in memory for fast access to data during computation and provide fault tolerance [110]. An RDD is an immutable distributed collection of key–value pairs of data, stored across nodes in the cluster. The RDD can be operated in parallel.
All operations preserve the order, except those that explicitly do not. Ordering is always "meaningful", not just after a sortBy
. For example, if you read a file (sc.textFile
) the lines of the RDD will be in the order that they were in the file.
Without trying to give a complete list, map
, filter
and flatMap
do preserve the order. sortBy
, partitionBy
, join
do not preserve the order.
The reason is that most RDD operations work on Iterator
s inside the partitions. So map
or filter
just has no way to mess up the order. You can take a look at the code to see for yourself.
You may now ask: What if I have an RDD with a HashPartitioner
. What happens when I use map
to change the keys? Well, they will stay in place, and now the RDD is not partitioned by the key. You can use partitionBy
to restore the partitioning with a shuffle.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With