Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

Tags:

I have a dataframe and a i am going to write it an a .csv file in S3 i use the following code:

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)

it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?

934

asked Oct 28 '16 12:10

chessosapiens

1 Answers

All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called

df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly

and one file inside called

part-00000

In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!

Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.

If you have 100 partitions, you will get:

part-00000
part-00001
...
part-00099

If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:

cat ${dir}.part-* > $flatFilePath

138

answered Oct 26 '22 18:10

Tim

Related questions
                            
                                Livy Server: return a dataframe as JSON?
                            
                                Online learning of LDA model in Spark
                            
                                Can Spark read data directly into a nested case class?
                            
                                Using airflow to run spark streaming jobs?
                            
                                Should cache and checkpoint be used together on DataSets? If so, how does this work under the hood?
                            
                                PySpark; DecimalType multiplication precision loss
                            
                                Understanding parallelism in Spark and Scala
                            
                                How to read XML files from apache spark framework?
                            
                                Change hadoop version using spark-ec2
                            
                                Spark SQL HiveContext - saveAsTable creates wrong schema
                            
                                Iterate through a Java RDD by row
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                spark submit java.lang.ClassNotFoundException
                            
                                Differentiate driver code and work code in Apache Spark
                            
                                Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL
                            
                                Unit testing with Spark dataframes
                            
                                Apache spark Hive, executable JAR with maven shade
                            
                                Non linear (DAG) ML pipelines in Apache Spark
                            
                                Pyspark socket timeout exception after application running for a while
                            
                                Share config files with spark-submit in cluster mode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark

Tags:

amazon-s3

apache-spark

apache-spark-sql

pyspark-sql

spark-dataframe

chessosapiens

People also ask

1 Answers

Tim

Recent Activity

Donate For Us