Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save a large Spark Dataframe as a single json file in S3

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1).save("s3n://mybucket/testfile","json")

But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.

Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?

Btw i need the data in a single file because another user is going to download it after.

*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.

Thanks a lot

JG

like image 333
jegordon Avatar asked Apr 28 '15 01:04

jegordon


People also ask

Can you store JSON in S3?

Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects.

What is multiline JSON?

Spark JSON data source API provides the multiline option to read records from multiple lines. By default, spark considers every record in a JSON file as a fully qualified record in a single line hence, we need to use the multiline option to process JSON from multiple lines.

How do I convert a Dataframe to a JSON string in Scala?

If you still can't figure out a way to convert Dataframe into JSON, you can use to_json or toJSON inbuilt Spark functions. Let me know if you have a sample Dataframe and a format of JSON to convert.


1 Answers

I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

df.write.mode('append').json(yourtargetpath)
like image 187
Jared Avatar answered Dec 19 '22 13:12

Jared