Im trying to save a Spark DataFrame (of more than 20G) to a single json file in Amazon S3, my code to save the dataframe is like this :
dataframe.repartition(1).save("s3n://mybucket/testfile","json")
But im getting an error from S3 "Your proposed upload exceeds the maximum allowed size", i know that the maximum file size allowed by Amazon is 5GB.
Is it possible to use S3 multipart upload with Spark? or there is another way to solve this?
Btw i need the data in a single file because another user is going to download it after.
*Im using apache spark 1.3.1 in a 3-node cluster created with the spark-ec2 script.
Thanks a lot
JG
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. It also works with objects that are compressed with GZIP or BZIP2 (for CSV and JSON objects only), and server-side encrypted objects.
Spark JSON data source API provides the multiline option to read records from multiple lines. By default, spark considers every record in a JSON file as a fully qualified record in a single line hence, we need to use the multiline option to process JSON from multiple lines.
If you still can't figure out a way to convert Dataframe into JSON, you can use to_json or toJSON inbuilt Spark functions. Let me know if you have a sample Dataframe and a format of JSON to convert.
I would try separating the large dataframe into a series of smaller dataframes that you then append into the same file in the target.
df.write.mode('append').json(yourtargetpath)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With