Saving a 25T SchemaRDD in Parquet format on S3

Question

I have encountered a number of problems when trying to save a very large SchemaRDD as in Parquet format on S3. I have already posted specific questions for those problems, but this is what I really need to do. The code should look something like this

import org.apache.spark._
val sqlContext = sql.SQLContext(sc)
val data = sqlContext.jsonFile("s3n://...", 10e-6)
data.saveAsParquetFile("s3n://...")

I run into problems if I have more than about 2000 partitions or if there is partition larger than 5G. This puts an upper bound on the maximum size SchemaRDD I can process this way. The prctical limit is closer to 1T since partitions sizes vary widely and you only need 1 5G partition to have the process fail.

Questions dealing with the specific problems I have encountered are

Multipart uploads to Amazon S3 from Apache Spark
Error when writing a repartitioned SchemaRDD to Parquet with Spark SQL
Spark SQL unable to complete writing Parquet data with a large number of shards

This questions is to see if there are any solutions to the main goal that do not necessarily involve solving one the above problems directly.

To distill things down there are 2 problems

Writing a single shard larger than 5G to S3 fails. AFAIK this a built in limit of s3n:// buckets. It should be possible for s3:// buckets but does not seem to work from Spark and hadoop distcp from local HDFS can not do it either.
Writing the summary file tends to fail once there are 1000s of shards. There seem to be multiple issues with this. Writing directly to S3 produces the error in the linked question above. Writing directly to local HDFS produces an OOM error even on an r3.8xlarge (244G ram) once when there about 5000 shards. This seems to be independent of the actual data volume. The summary file seems essential for efficient querying.

Taken together these problems limit Parquet tables on S3 to 25T. In practice it is actually significantly less since shard sizes can vary widely within an RDD and the 5G limit applies to the largest shard.

How can I write a >>25T RDD as Parquet to S3?

I am using Spark-1.1.0.

Adam Ocsvari · Accepted Answer

From AWS S3 documentation:

The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from 1 byte to 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.

One way to go around this:

Attache an EBS volume to your system, format it.
Copy the files to the "local" EBS volume.
Snapshot the volume, it goes to your S3 automatically.

It also gives a smaller load on your instance.

To access that data, you need to attache the snapshot as an EBS to an instance.

Saving a >>25T SchemaRDD in Parquet format on S3

Tags:

amazon-s3

apache-spark

apache-spark-sql

parquet

Daniel Mahler

1 Answers

Adam Ocsvari

Recent Activity

Donate For Us