I use Spark 1.6.0 and Scala.
I want to save a DataFrame as compressed CSV format.
Here is what I have so far (assume I already have df
and sc
as SparkContext
):
//set the conf to the codec I want
sc.getConf.set("spark.hadoop.mapred.output.compress", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "true")
sc.getConf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
sc.getConf.set("spark.hadoop.mapred.output.compression.type", "BLOCK")
df.write
.format("com.databricks.spark.csv")
.save(my_directory)
The output is not in gz
format.
This code works for Spark 2.1, where .codec
is not available.
df.write
.format("com.databricks.spark.csv")
.option("codec", "org.apache.hadoop.io.compress.GzipCodec")
.save(my_directory)
For Spark 2.2, you can use the df.write.csv(...,codec="gzip")
option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec
With Spark 2.0+, this has become a bit simpler:
df.write.csv("path", compression="gzip") # Python-only
df.write.option("compression", "gzip").csv("path") // Scala or Python
You don't need the external Databricks CSV package anymore.
The csv()
writer supports a number of handy options. For example:
sep
: To set the separator character.quote
: Whether and how to quote values.header
: Whether to include a header line.There are also a number of other compression codecs you can use, in addition to gzip
:
bzip2
lz4
snappy
deflate
The full Spark docs for the csv()
writer are here: Python / Scala
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With