Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.sparkstaging directory in hdfs is not deleted

Tags:

apache-spark

We are running certain spark jobs and we see .sparkstaging directoring in hdfs persisting after the job completion. Is there any parameter we need to set to delete the staging directory after job completion?

spark.yarn.preserve.staging.files is false by default and hence we have not set it explicitly. we are running spark on yarn using hortonworks and spark version 1.2

Regards, Manju

like image 923
armourbear Avatar asked Mar 30 '15 18:03

armourbear


1 Answers

Please check for the following log events in the job completion console output to get more insights into what's going on:

  1. ApplicationMaster: Deleting staging directory .sparkStaging/application_xxxxxx_xxxx - this means that the application was able to successfully clean up the staging directory
  2. ApplicationMaster: Staging directory is null - this means that application was not to able to find the staging dir for this application
  3. ApplicationMaster: Failed to cleanup staging dir .sparkStaging/application_xxxxxx_xxxx - this means something went wrong deleting the staging directory

Could you also double check these properties in the cluster which can affect the scenario you have mentioned: spark.yarn.preserve.staging.files and this SPARK_YARN_STAGING_DIR.

like image 188
Ashrith Avatar answered Sep 29 '22 07:09

Ashrith