Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change spark _temporary directory path

Is it possible to change the _temporary directory where spark save its temporary files before writing?

In particular, since I am writing single partitions of a table I woud like the temporary folder to be within the partition folder.

Is it possibile?

like image 389
Alessandro Avatar asked Apr 09 '19 15:04

Alessandro


1 Answers

There is no way to use the default FileOutputCommitter because of its implementation, the FileOutputCommiter creates a ${mapred.output.dir}/_temporary subdirectory where the files are written and later on, after being committed, moved to ${mapred.output.dir}.

In the end, an entire temporary folder deleted. When two or more Spark jobs have the same output directory, mutual deletion of files will be inevitable.

Eventually, I've downloaded org.apache.hadoop.mapred.FileOutputCommitter and org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter (you can name it YourFileOutputCommitter) made some changes that allows _temporaly rename

in your driver, you'll have to add following code:

val conf: JobConf = new JobConf(sc.hadoopConfiguration)
conf.setOutputCommitter(classOf[YourFileOutputCommitter])


// update temporary path for committer 
YourFileOutputCommitter.tempPath = "_tempJob1"

note: it's better to use MultipleTextOutputFormat to rename files because two jobs that write to the same location can override each other.

Update

I've created short post in our tech blog, it has more details https://www.outbrain.com/techblog/2020/03/how-you-can-set-many-spark-jobs-write-to-the-same-path/

like image 87
Arkadiy Verman Avatar answered Nov 06 '22 10:11

Arkadiy Verman