Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

saving compressed json from spark

Tags:

from Spark RDDs, I want to stage and archive JSON data to AWS S3. It only makes sense to compress it, and I have a process working using hadoop's GzipCodec, but there's things that make me nervous about this.

When I look at the type signature of org.apache.spark.rdd.RDD.saveAsTextFile here:

https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.rdd.RDD

the type signature is:

def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit

but when I check the available compression codecs here:

https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.io.CompressionCodec

the parent trait CompressionCodec and subtypes all say:

The wire protocol for a codec is not guaranteed compatible across versions of Spark. This is intended for use as an internal compression utility within a single Spark application

That's no good... but it's fine, because gzip is probably easier to deal with across ecosystems anyway.

The type signature says the codec must be a subtype of CompressionCodec... but I tried the following to save as .gz, and it works fine, even though hadoop's GzipCodec is not <: CompressionCodec.

import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsTextFile(bucketName, classOf[GzipCodec])

my questions:

  • this works, but are there any reasons to not do it this way... or is there a better way?
  • is this going to be robust across Spark versions (and elsewhere) unlike the built in compression codecs?
like image 488
kmh Avatar asked Sep 14 '18 23:09

kmh


1 Answers

Well, for starters, are you bound to RDDs or can you use DataSets/DataFrames ?

With DataFrames you can use something like

 df.write.format("json").
    option("compression", "org.apache.hadoop.io.compress.GzipCodec").
    save("...")

However, there are a few considerations. Compression is great, but if the files you're generating are very big, you have to keep in mind that gzip is not a splittable format, that is, if you want later to process that file, it will have to be read by one worker. For example, if your file is non-splittable and it's 1G, it will take T time to process, if it were splittable (like LZO, Snappy or BZip2), it could be processed in T/N where N is the number of splits (assuming 128MB blocks, that would be about 8). That's why Hadoop uses SequenceFiles (which are splittable, and use gzip within one block), and that's why the compressed format of choice when storing to S3 is usually Parquet. Parquet files are smaller than Gzipped ones, and are splittable, that is, its content can be processed by multiple workers. You could still use gzipped text files, but keep them in the ~100/200Mbyte range.

At the end of the day, it really depends on what you are planning to do with the data in S3.

Is it going to be queried ? In that case Parquet is a much better choice as format.

Is it going to be read/copied to other systems that do not understand parquet ? Then gzip compression is ok. And it's stable, You don't have to worry about it changing. You can try it yourself, save some sample data on S3, you can still open it with any gzip tool.

like image 123
Roberto Congiu Avatar answered Oct 07 '22 09:10

Roberto Congiu