Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - How to write a single csv file WITHOUT folder?

Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is

df.coalesce(1).write.option("header", "true").csv("name.csv")

This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.

I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).

Any help is appreciated.

like image 617
antonioACR1 Avatar asked Apr 27 '17 15:04

antonioACR1


People also ask

How do I write to a single file in PySpark?

Write a Single file using Spark coalesce() & repartition() This still creates a directory and write a single part file inside a directory instead of multiple part files. Both coalesce() and repartition() are Spark Transformation operations that shuffle the data from multiple partitions into a single partition.

How do I save a Spark file to CSV?

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.

Can Spark write to local file system?

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.


2 Answers

A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:

df.toPandas().to_csv("<path>/<filename>") 

EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.

like image 92
Paul Vbl Avatar answered Oct 25 '22 06:10

Paul Vbl


If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.

import csv  def spark_to_csv(df, file_path):     """ Converts spark dataframe to CSV file """     with open(file_path, "w") as f:         writer = csv.DictWriter(f, fieldnames=df.columns)         writer.writerow(dict(zip(fieldnames, fieldnames)))         for row in df.toLocalIterator():             writer.writerow(row.asDict()) 
like image 38
smw Avatar answered Oct 25 '22 07:10

smw