Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save content of Spark DataFrame as a single CSV file [duplicate]

Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.

The default behavior is to save the output in multiple part-*.csv files inside the path provided.

How would I save a DF with :

  1. Path mapping to the exact file name instead of folder
  2. Header available in first line
  3. Save as a single file instead of multiple files.

One way to deal with it, is to coalesce the DF and then save the file.

df.coalesce(1).write.option("header", "true").csv("sample_file.csv") 

However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.

Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?

like image 639
Spandan Brahmbhatt Avatar asked Jan 31 '17 21:01

Spandan Brahmbhatt


People also ask

How do I write a CSV file from a Dataframe in spark?

Spark DataFrameWriter class provides a method csv () to save or write a DataFrame at a specified path on disk, this method takes a file path where you wanted to write a file and by default, it doesn’t write a header or column names. df. write. csv ("/tmp/spark_output/datacsv") df. write. format ("csv"). save ("/tmp/spark_output/datacsv")

How do I save a dataframewriter in spark?

After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file. The default behavior is to save the output in multiple part-*.csv files inside the path provided. How would I save a DF with : Save as a single file instead of multiple files.

How to get the partition size of a Dataframe in spark?

Note: Depending on the number of partitions you have for DataFrame, it writes the same number of part files in a directory specified as a path. You can get the partition size by using the below snippet. For more details on partitions refer to Spark Partitioning. If you wanted to write as a single CSV file, refer to Spark Write Single CSV File.

How does spark write to multiple files?

By design, when you save an RDD, DataFrame, or Dataset, Spark creates a folder with the name specified in a path and writes data as multiple part files in parallel (one-part file for each partition). Each part file will have an extension of the format you write (for example .csv, .json, .txt e.t.c)


2 Answers

Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.

save_location= "s3a://landing-bucket-test/export/"+year csv_location = save_location+"temp.folder" file_location = save_location+'export.csv'  df.repartition(1).write.csv(path=csv_location, mode="append", header="true")  file = dbutils.fs.ls(csv_location)[-1].path dbutils.fs.cp(file, file_location) dbutils.fs.rm(csv_location, recurse=True) 

This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).

like image 104
user1217169 Avatar answered Oct 02 '22 13:10

user1217169


Use: df.toPandas().to_csv("sample_file.csv", header=True)

See documentation for details: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.toPandas

like image 32
osbon123 Avatar answered Oct 02 '22 13:10

osbon123