Specifying the filename when saving a DataFrame as a CSV [duplicate]

Tags:

Say I have a Spark DF that I want to save to disk a CSV file. In Spark 2.0.0+, one can convert DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.

The function is defined as

def csv(path: String): Unit     path : the location/folder name and not the file name.

Spark stores the csv file at the location specified by creating CSV files with name - part-*.csv.

Is there a way to save the CSV with specified filename instead of part-*.csv ? Or possible to specify prefix to instead of part-r ?

Code :

df.coalesce(1).write.csv("sample_path")

Current Output :

sample_path | +-- part-r-00000.csv

Desired Output :

sample_path | +-- my_file.csv

Note : The coalesce function is used to output a single file and the executor has enough memory to collect the DF without memory error.

892

asked Feb 01 '17 21:02

Spandan Brahmbhatt

1 Answers

It's not possible to do it directly in Spark's save

Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in this question

In Scala it will look like:

import org.apache.hadoop.fs._ val fs = FileSystem.get(sc.hadoopConfiguration) val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName()  fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv")) fs.delete(new Path("mydata.csv-temp"), true)

or just:

import org.apache.hadoop.fs._ val fs = FileSystem.get(sc.hadoopConfiguration) fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"))

Edit: As mentioned in comments, you can also write your own OutputFormat, please see documents for information about this approach to set file name

130

answered Sep 20 '22 17:09

T. Gawęda

Related questions
                            
                                What are the Spark transformations that causes a Shuffle?
                            
                                Function parameter types and =>
                            
                                Scala foreach strange behaviour
                            
                                How to set hadoop configuration values from pyspark
                            
                                How to set amount of Spark executors?
                            
                                How can I pattern match on a range in Scala?
                            
                                Increment for-loop by 2 in Scala
                            
                                How to define an Ordering in Scala?
                            
                                Why Some(null) isn't considered None?
                            
                                Most elegant repeat loop in Scala
                            
                                Scala maps -> operator
                            
                                Capitalize the first letter of every word in Scala
                            
                                Aggregating multiple columns with custom function in Spark
                            
                                Running Java gives "Error: could not open `C:\Program Files\Java\jre6\lib\amd64\jvm.cfg'"
                            
                                Using the "Prolog in Scala" to find available type class instances
                            
                                Static return type of Scala macros
                            
                                Is there a good GnuPG encryption library for Java/Scala? [closed]
                            
                                Are Options and named default arguments like oil and water in a Scala API?
                            
                                How to investigate objects/types/etc. from Scala REPL?
                            
                                Securing REST API on Play framework and OAuth2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Specifying the filename when saving a DataFrame as a CSV [duplicate]

Tags:

csv

scala

apache-spark

pyspark

Spandan Brahmbhatt

People also ask

1 Answers

T. Gawęda

Recent Activity

Donate For Us