Suppose that df
is a dataframe in Spark. The way to write df
into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv
but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv
.
I would like to know if it is possible to avoid the folder name.csv
and to have the actual CSV file called name.csv
and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv
. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
Write a Single file using Spark coalesce() & repartition() This still creates a directory and write a single part file inside a directory instead of multiple part files. Both coalesce() and repartition() are Spark Transformation operations that shuffle the data from multiple partitions into a single partition.
In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv def spark_to_csv(df, file_path): """ Converts spark dataframe to CSV file """ with open(file_path, "w") as f: writer = csv.DictWriter(f, fieldnames=df.columns) writer.writerow(dict(zip(fieldnames, fieldnames))) for row in df.toLocalIterator(): writer.writerow(row.asDict())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With