I am trying to figure out all sources of non-determinism in Spark. I understand that non-determinism can come from user provided functions e.g in a map(f) with f involving random. I am instead looking for the operations that can lead to non-determinism either in terms of transformations/actions of at a lower level e.g shuffling.
Off the top of my head:
operations which require shuffling (or network traffic in general) may output values in non-deterministic order. It includes obvious cases like groupBy*
or join
. A less obvious example is an order of ties after sorting
operations which depend on the changing data sources or a mutable global state
side effects executed inside transformations, including accumulator
updates
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With