Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make spark write a _SUCCESS file for empty parquet output?

Tags:

apache-spark

One of my spark jobs is currently running over empty input and so produces no output. That's fine for now, but I still need to know that the spark job ran even if it produced no parquet output.

Is there a way of forcing spark to write a _SUCCESS file even if there was no output at all? Currently it doesn't write anything to the directory where there would be output if there was input so I've no way of determining if there was a failure (this is part of a larger automated pipeline and so it keeps rescheduling the job because there's no indication it already ran).

like image 373
jbrown Avatar asked Nov 08 '22 19:11

jbrown


1 Answers

_SUCESS file is written by Hadoop code. So if your spark app doesn't generate any output you can use Hadoop API to create _SUCCESS file yourself.

If you are using PySpark - look into https://github.com/spotify/snakebite

If you are using Scala or Java - look into Hadoop API.

Alternative would be to ask Spark write empty dataset into to the output. But this might not what you need - because there will be part-00000 and _SUCESS file, which downstream consumers might not like.

Here is how to save empty dataset in pyspark (in Scala the code should be the same)

$ pyspark
>>> sc.parallelize([], 1).saveAsTextFile("/path/on/hdfs")
>>> exit()

$ hadoop fs -ls /path/on/hdfs
Found 2 items
-rw-r--r--   2 user user          0 2016-02-25 12:54 /path/on/hdfs/_SUCCESS
-rw-r--r--   2 user user          0 2016-02-25 12:54 /path/on/hdfs/part-00000
like image 79
vvladymyrov Avatar answered Nov 15 '22 09:11

vvladymyrov