Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sources of non-determinism of Apache Spark

I am trying to figure out all sources of non-determinism in Spark. I understand that non-determinism can come from user provided functions e.g in a map(f) with f involving random. I am instead looking for the operations that can lead to non-determinism either in terms of transformations/actions of at a lower level e.g shuffling.

like image 269
savx2 Avatar asked Oct 31 '22 13:10

savx2


1 Answers

Off the top of my head:

  • operations which require shuffling (or network traffic in general) may output values in non-deterministic order. It includes obvious cases like groupBy* or join. A less obvious example is an order of ties after sorting

  • operations which depend on the changing data sources or a mutable global state

  • side effects executed inside transformations, including accumulator updates

like image 83
zero323 Avatar answered Nov 09 '22 03:11

zero323