One of my spark jobs is currently running over empty input and so produces no output. That's fine for now, but I still need to know that the spark job ran even if it produced no parquet output.
Is there a way of forcing spark to write a _SUCCESS
file even if there was no output at all? Currently it doesn't write anything to the directory where there would be output if there was input so I've no way of determining if there was a failure (this is part of a larger automated pipeline and so it keeps rescheduling the job because there's no indication it already ran).
_SUCESS
file is written by Hadoop code. So if your spark app doesn't generate any output you can use Hadoop API to create _SUCCESS file yourself.
If you are using PySpark - look into https://github.com/spotify/snakebite
If you are using Scala or Java - look into Hadoop API.
Alternative would be to ask Spark write empty dataset into to the output. But this might not what you need - because there will be part-00000
and _SUCESS
file, which downstream consumers might not like.
Here is how to save empty dataset in pyspark (in Scala the code should be the same)
$ pyspark
>>> sc.parallelize([], 1).saveAsTextFile("/path/on/hdfs")
>>> exit()
$ hadoop fs -ls /path/on/hdfs
Found 2 items
-rw-r--r-- 2 user user 0 2016-02-25 12:54 /path/on/hdfs/_SUCCESS
-rw-r--r-- 2 user user 0 2016-02-25 12:54 /path/on/hdfs/part-00000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With