Save a large Spark Dataframe as a single json file in S3

Tags:

Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :

dataframe.repartition(1).save("s3n://mybucket/testfile","json")

But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.

Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?

Btw i need the data in a single file because another user is going to download it after.

*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.

Thanks a lot

333

asked Apr 28 '15 01:04

jegordon

1 Answers

I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.

df.write.mode('append').json(yourtargetpath)

187

answered Dec 19 '22 13:12

Jared

Related questions
                            
                                Stopping a Running Spark Application
                            
                                Where are the Spark logs on EMR?
                            
                                ImportError: No module named numpy on spark workers
                            
                                PySpark converting a column of type 'map' to multiple columns in a dataframe
                            
                                Accessing Spark SQL RDD tables through the Thrift Server
                            
                                Spark save(write) parquet only one file
                            
                                Using Grouped Map Pandas UDFs with arguments
                            
                                How to use custom classes with Apache Spark (pyspark)?
                            
                                Increase Spark memory when using local[*]
                            
                                Is groupByKey ever preferred over reduceByKey
                            
                                spark-submit, how to specify log4j.properties
                            
                                issue Running Spark Job on Yarn Cluster
                            
                                Does Spark know the partitioning key of a DataFrame?
                            
                                How to get the number of workers(executors) in PySpark?
                            
                                How to read a nested collection in Spark
                            
                                Initialize an RDD to empty
                            
                                Spark Build Custom Column Function, user defined function
                            
                                Why do we need to add "fork in run := true" when running Spark SBT application?
                            
                                filter spark dataframe with row field that is an array of strings
                            
                                Spark Data Frame Random Splitting

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Save a large Spark Dataframe as a single json file in S3

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

jegordon

People also ask

1 Answers

Jared

Recent Activity

Donate For Us