So I'm saving a spark RDD to a S3 bucket using following code. Is there a way to compress(in gz format) and save instead of saving it as a text file.
help_data.repartition(5).saveAsTextFile("s3://help-test/logs/help")
saveAsTextFile
method takes an optional argument which specifies compression codec class:
help_data.repartition(5).saveAsTextFile(
path="s3://help-test/logs/help",
compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With