from Spark RDDs, I want to stage and archive JSON data to AWS S3. It only makes sense to compress it, and I have a process working using hadoop's GzipCodec
, but there's things that make me nervous about this.
When I look at the type signature of org.apache.spark.rdd.RDD.saveAsTextFile
here:
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.rdd.RDD
the type signature is:
def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit
but when I check the available compression codecs here:
https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.io.CompressionCodec
the parent trait CompressionCodec
and subtypes all say:
The wire protocol for a codec is not guaranteed compatible across versions of Spark. This is intended for use as an internal compression utility within a single Spark application
That's no good... but it's fine, because gzip is probably easier to deal with across ecosystems anyway.
The type signature says the codec must be a subtype of CompressionCodec
... but I tried the following to save as .gz, and it works fine, even though hadoop's GzipCodec is not <: CompressionCodec
.
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsTextFile(bucketName, classOf[GzipCodec])
my questions:
Well, for starters, are you bound to RDDs or can you use DataSets/DataFrames ?
With DataFrames you can use something like
df.write.format("json").
option("compression", "org.apache.hadoop.io.compress.GzipCodec").
save("...")
However, there are a few considerations. Compression is great, but if the files you're generating are very big, you have to keep in mind that gzip is not a splittable format, that is, if you want later to process that file, it will have to be read by one worker. For example, if your file is non-splittable and it's 1G, it will take T time to process, if it were splittable (like LZO, Snappy or BZip2), it could be processed in T/N where N is the number of splits (assuming 128MB blocks, that would be about 8). That's why Hadoop uses SequenceFiles (which are splittable, and use gzip within one block), and that's why the compressed format of choice when storing to S3 is usually Parquet. Parquet files are smaller than Gzipped ones, and are splittable, that is, its content can be processed by multiple workers. You could still use gzipped text files, but keep them in the ~100/200Mbyte range.
At the end of the day, it really depends on what you are planning to do with the data in S3.
Is it going to be queried ? In that case Parquet is a much better choice as format.
Is it going to be read/copied to other systems that do not understand parquet ? Then gzip compression is ok. And it's stable, You don't have to worry about it changing. You can try it yourself, save some sample data on S3, you can still open it with any gzip tool.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With