Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a spark RDD in gzip format through pyspark

So I'm saving a spark RDD to a S3 bucket using following code. Is there a way to compress(in gz format) and save instead of saving it as a text file.

help_data.repartition(5).saveAsTextFile("s3://help-test/logs/help")
like image 831
rclakmal Avatar asked Dec 10 '15 14:12

rclakmal


1 Answers

saveAsTextFile method takes an optional argument which specifies compression codec class:

help_data.repartition(5).saveAsTextFile(
    path="s3://help-test/logs/help",
    compressionCodecClass="org.apache.hadoop.io.compress.GzipCodec"
)
like image 66
zero323 Avatar answered Sep 28 '22 20:09

zero323