Spark _temporary creation reason

Question

Why does spark, while saving result to a file system, uploads result files to a _temporary directory and then move them to output folder instead of directly uploading them to output folder?

zero323 · Accepted Answer

Two stage process is the simplest way to ensure consistency of the final result when working with file systems.

You have to remember that each executor thread writes its result set independent of the other threads and writes can be performed at different moments in time or even reuse the same set of resources. At the moment of write Spark cannot determine if all writes will succeed.

In case of failure one can rollback the changes by removing temporary directory.
In case of success one can commit the changes by moving temporary directory.

Another benefit of this model is clear distinction between writes in progress and finalized output. As a result it can easily integrated with simple workflow management tools, without a need of having a separate state store or other synchronization mechanism.

This model is simple, reliable and works well with file systems for which it has been designed. Unfortunately it doesn't perform that well with object stores, which don't support moves.

Spark _temporary creation reason

Tags:

apache-spark

Shubham Jain

1 Answers

zero323

Recent Activity

Donate For Us

Spark _temporary creation reason

Tags:

apache-spark

Shubham Jain

1 Answers

zero323

Related questions

Recent Activity

Donate For Us