Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows])
as a DataFrameWriter
and use the .csv
method to write the file.
The function is defined as
def csv(path: String): Unit path : the location/folder name and not the file name.
Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.
Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?
Code :
df.coalesce(1).write.csv("sample_path")
Current Output :
sample_path | +-- part-r-00000.csv
Desired Output :
sample_path | +-- my_file.csv
Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.
Use fs. rename() by passing source and destination paths to rename a file.
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part-
files. You can easily change filename after processing just like in this question
In Scala it will look like:
import org.apache.hadoop.fs._ val fs = FileSystem.get(sc.hadoopConfiguration) val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName() fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv")) fs.delete(new Path("mydata.csv-temp"), true)
or just:
import org.apache.hadoop.fs._ val fs = FileSystem.get(sc.hadoopConfiguration) fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))
Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With