How to save a spark RDD in gzip format through pyspark

Question

So I'm saving a spark RDD to a S3 bucket using following code. Is there a way to compress(in gz format) and save instead of saving it as a text file.

help_data.repartition(5).saveAsTextFile("s3://help-test/logs/help")

zero323 · Accepted Answer

saveAsTextFile method takes an optional argument which specifies compression codec class:

help_data.repartition(5).saveAsTextFile(
    path="s3://help-test/logs/help",
    compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)

Donate For Us