Suppose that <code>df</code> is a dataframe in Spark. The way to write <code>df</code> into a single CSV file is <code>df.coalesce(1).write.option("header", "true").csv("name.csv")</code> This will write the dataframe into a CSV file contained in a folder called <code>name.csv</code> but the actual CSV file will be called something like <code>part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv</code>. I would like to know if it is possible to avoid the folder <code>name.csv</code> and to have the actual CSV file called <code>name.csv</code> and not <code>part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv</code>. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders). Any help is appreciated.

A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv: <pre class="prettyprint"><code>df.toPandas().to_csv("<path>/<filename>") </code></pre> EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.

If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir. <pre class="prettyprint lang-py prettyprint-override"><code>import csv def spark_to_csv(df, file_path): """ Converts spark dataframe to CSV file """ with open(file_path, "w") as f: writer = csv.DictWriter(f, fieldnames=df.columns) writer.writerow(dict(zip(fieldnames, fieldnames))) for row in df.toLocalIterator(): writer.writerow(row.asDict()) </code></pre>

Spark - How to write a single csv file WITHOUT folder?

Tags:

csv

export-to-csv

apache-spark

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is

df.coalesce(1).write.option("header", "true").csv("name.csv")

This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.

I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).

Any help is appreciated.

617

asked Apr 27 '17 15:04

antonioACR1

2 Answers

A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:

df.toPandas().to_csv("<path>/<filename>")

EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.

answered Oct 25 '22 06:10

Paul Vbl

If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.

import csv  def spark_to_csv(df, file_path):     """ Converts spark dataframe to CSV file """     with open(file_path, "w") as f:         writer = csv.DictWriter(f, fieldnames=df.columns)         writer.writerow(dict(zip(fieldnames, fieldnames)))         for row in df.toLocalIterator():             writer.writerow(row.asDict())

answered Oct 25 '22 07:10

smw

Related questions
                            
                                Is it possible to get the number of rows in a CSV file without opening it?
                            
                                Writing a CSV from Flask framework [duplicate]
                            
                                scikit learn output metrics.classification_report into CSV/tab-delimited format
                            
                                Is there a good emacs mode for displaying and editing huge delimiter separated files?
                            
                                Converting JSON to XLS/CSV in Java [closed]
                            
                                Using CSVHelper to output stream to browser
                            
                                How to convert dataframe into time series?
                            
                                CSV decimal dot in Excel
                            
                                Reading a csv with a timestamp column, with pandas
                            
                                Oracle: Import CSV file
                            
                                Download a file from HTTPS using download.file()
                            
                                Exporting data from SQL Server Express to CSV (need quoting and escaping)
                            
                                Convert csv to parquet file using python
                            
                                Write a dataframe to csv file with value of NA as blank
                            
                                sort csv by column
                            
                                UTF-8 problems while reading CSV file with fgetcsv
                            
                                Permission denied when trying to import a CSV file from PGAdmin
                            
                                How to import csv file in PHP?
                            
                                Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print
                            
                                Convert CSV to JSON using PHP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With