I am using Spark Structured Streaming's streaming query to write parquet files to S3 using the following code:
ds.writeStream().format("parquet").outputMode(OutputMode.Append())
.option("queryName", "myStreamingQuery")
.option("checkpointLocation", "s3a://my-kafka-offset-bucket-name/")
.option("path", "s3a://my-data-output-bucket-name/")
.partitionBy("createdat")
.start();
I get the desired output in the s3 bucket my-data-output-bucket-name
but along with the output, I get the _spark_metadata
folder in it. How to get rid of it? If I can't get rid of it, how to change it's location to a different S3 bucket?
My understanding is that it is not possible up to Spark 2.3.
The name of the metadata directory is always _spark_metadata
_spark_metadata
directory is always at the location where path
option points to
I think the only way to "fix" it is to report an issue in Apache Spark's JIRA and hope someone would pick it up.
The flow is that DataSource
is requested to create the sink of a streaming query and takes the path
option. With that, it goes to create a FileStreamSink. The path
option simply becomes the basePath where the results are written to as well as the metadata.
You can find the initial commit quite useful to understand the purpose of the metadata directory.
In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based
DataSource
is initialized for reading, we first check for this log directory and use it instead of file listing when present.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With