Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark _temporary creation reason

Tags:

apache-spark

Why does spark, while saving result to a file system, uploads result files to a _temporary directory and then move them to output folder instead of directly uploading them to output folder?

like image 523
Shubham Jain Avatar asked Oct 23 '17 05:10

Shubham Jain


1 Answers

Two stage process is the simplest way to ensure consistency of the final result when working with file systems.

You have to remember that each executor thread writes its result set independent of the other threads and writes can be performed at different moments in time or even reuse the same set of resources. At the moment of write Spark cannot determine if all writes will succeed.

  • In case of failure one can rollback the changes by removing temporary directory.
  • In case of success one can commit the changes by moving temporary directory.

Another benefit of this model is clear distinction between writes in progress and finalized output. As a result it can easily integrated with simple workflow management tools, without a need of having a separate state store or other synchronization mechanism.

This model is simple, reliable and works well with file systems for which it has been designed. Unfortunately it doesn't perform that well with object stores, which don't support moves.

like image 98
zero323 Avatar answered Oct 11 '22 17:10

zero323