Save content of Spark DataFrame as a single CSV file [duplicate]

Tags:

Say I have a Spark DataFrame which I want to save as CSV file. After Spark 2.0.0 , DataFrameWriter class directly supports saving it as a CSV file.

The default behavior is to save the output in multiple part-*.csv files inside the path provided.

How would I save a DF with :

Path mapping to the exact file name instead of folder
Header available in first line
Save as a single file instead of multiple files.

One way to deal with it, is to coalesce the DF and then save the file.

df.coalesce(1).write.option("header", "true").csv("sample_file.csv")

However this has disadvantage in collecting it on Master machine and needs to have a master with enough memory.

Is it possible to write a single CSV file without using coalesce ? If not, is there a efficient way than the above code ?

639

asked Jan 31 '17 21:01

Spandan Brahmbhatt

2 Answers

Just solved this myself using pyspark with dbutils to get the .csv and rename to the wanted filename.

save_location= "s3a://landing-bucket-test/export/"+year csv_location = save_location+"temp.folder" file_location = save_location+'export.csv'  df.repartition(1).write.csv(path=csv_location, mode="append", header="true")  file = dbutils.fs.ls(csv_location)[-1].path dbutils.fs.cp(file, file_location) dbutils.fs.rm(csv_location, recurse=True)

This answer can be improved by not using [-1], but the .csv seems to always be last in the folder. Simple and fast solution if you only work on smaller files and can use repartition(1) or coalesce(1).

104

answered Oct 02 '22 13:10

user1217169

Use: df.toPandas().to_csv("sample_file.csv", header=True)

See documentation for details: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataFrame.toPandas

answered Oct 02 '22 13:10

osbon123

Related questions
                            
                                Is it possible to turn off quote processing in the Postgres COPY command with CSV format?
                            
                                PHP read CSV file line by lines
                            
                                How to insert CSV data into PostgreSQL database (remote database )
                            
                                How to access to specify file in subfolder without change working directory In R?
                            
                                Converting CSV to JSON in bash
                            
                                UTF-8 encoidng issue when exporting csv file , JavaScript
                            
                                How to write the resulting RDD to a csv file in Spark python
                            
                                Can I export a tensorflow summary to CSV?
                            
                                Creating a dictionary from a CSV file
                            
                                How to save an Excel worksheet as CSV
                            
                                How can I read only the header column of a CSV file using Python?
                            
                                How can I correct MySQL Load Error
                            
                                Splitting one csv into multiple files
                            
                                Parse Delimited CSV in .NET
                            
                                Is there a library that can write an RFC 4180 CSV file with PHP? [closed]
                            
                                Read CSV file with comma within fields in Python
                            
                                Python: Convert string (in scientific notation) to float
                            
                                Import CSV into SQL Server (including automatic table creation) [duplicate]
                            
                                import from CSV into Ruby array, with 1st field as hash key, then lookup a field's value given header row
                            
                                How to generate a schema from a CSV for a PostgreSQL Copy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Save content of Spark DataFrame as a single CSV file [duplicate]

Tags:

csv

apache-spark

pyspark

Spandan Brahmbhatt

People also ask

2 Answers

user1217169

osbon123

Recent Activity

Donate For Us