Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.
The default behavior is to save the output in multiple part-*.csv files inside the path provided.
How would I save a DF with :
One way to deal with it, is to coalesce the DF and then save the file.
df.coalesce(1).write.option("header", "true").csv("sample_file.csv")
However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.
Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?
Spark DataFrameWriter class provides a method csv () to save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn’t write a header or column names. df. write. csv ("/tmp/spark_output/datacsv") df. write. format ("csv"). save ("/tmp/spark_output/datacsv")
After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file. The default behavior is to save the output in multiple part-*.csv files inside the path provided. How would I save a DF with : Save as a single file instead of multiple files.
Note: Depending on the number of partitions you have for DataFrame, it writes the same number of part files in a directory specified as a path. You can get the partition size by using the below snippet. For more details on partitions refer to Spark Partitioning. If you wanted to write as a single CSV file, refer to Spark Write Single CSV File.
By design, when you save an RDD, DataFrame, or Dataset, Spark creates a folder with the name specified in a path and writes data as multiple part files in parallel (one-part file for each partition). Each part file will have an extension of the format you write (for example .csv, .json, .txt e.t.c)
Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.
save_location= "s3a://landing-bucket-test/export/"+year csv_location = save_location+"temp.folder" file_location = save_location+'export.csv' df.repartition(1).write.csv(path=csv_location, mode="append", header="true") file = dbutils.fs.ls(csv_location)[-1].path dbutils.fs.cp(file, file_location) dbutils.fs.rm(csv_location, recurse=True)
This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).
Use: df.toPandas().to_csv("sample_file.csv", header=True)
See documentation for details: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.toPandas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With