For example, the result of this: <pre class="prettyprint"><code>df.filter("project = 'en'").select("title","count").groupBy("title").sum() </code></pre> would return an Array. How to save a spark DataFrame as a csv file on disk ?

Apache Spark does not support native CSV output on disk. You have four available solutions though: <ol> <li> You can convert your Dataframe into an RDD : <pre class="prettyprint lang-scala prettyprint-override"><code>def convertToReadableString(r : Row) = ??? df.rdd.map{ convertToReadableString }.saveAsTextFile(filepath) </code></pre> This will create a folder filepath. Under the file path, you'll find partitions files (e.g part-000*) What I usually do if I want to append all the partitions into a big CSV is <pre class="prettyprint"><code>cat filePath/part* > mycsvfile.csv </code></pre> Some will use <code>coalesce(1,false)</code> to create one partition from the RDD. It's usually a bad practice, since it may overwhelm the driver by pulling all the data you are collecting to it. Note that <code>df.rdd</code> will return an <code>RDD[Row]</code>. </li> <li> With Spark <2, you can use databricks spark-csv library: <ul> <li> Spark 1.4+: <pre class="prettyprint lang-scala prettyprint-override"><code>df.write.format("com.databricks.spark.csv").save(filepath) </code></pre> </li> <li> Spark 1.3: <pre class="prettyprint lang-scala prettyprint-override"><code>df.save(filepath,"com.databricks.spark.csv") </code></pre> </li> </ul> </li> <li> With Spark 2.x the <code>spark-csv</code> package is not needed as it's included in Spark. <pre class="prettyprint"><code>df.write.format("csv").save(filepath) </code></pre> </li> <li>You can convert to local Pandas data frame and use <code>to_csv</code> method (PySpark only).</li> </ol> Note: Solutions 1, 2 and 3 will result in CSV format files (<code>part-*</code>) generated by the underlying Hadoop API that Spark calls when you invoke <code>save</code>. You will have one <code>part-</code> file per partition.

How to save a spark DataFrame as csv on disk?

Tags:

scala

apache-spark

apache-spark-sql

For example, the result of this:

Click to copy

df.filter("project = 'en'").select("title","count").groupBy("title").sum()

would return an Array.

How to save a spark DataFrame as a csv file on disk ?

398

asked Oct 16 '15 15:10

Hello lad

2 Answers

Apache Spark does not support native CSV output on disk.

You have four available solutions though:

You can convert your Dataframe into an RDD :

Click to copy
```
def convertToReadableString(r : Row) = ??? df.rdd.map{ convertToReadableString }.saveAsTextFile(filepath) 
```
This will create a folder filepath. Under the file path, you'll find partitions files (e.g part-000*)

What I usually do if I want to append all the partitions into a big CSV is

Click to copy
```
cat filePath/part* > mycsvfile.csv 
```
Some will use coalesce(1,false) to create one partition from the RDD. It's usually a bad practice, since it may overwhelm the driver by pulling all the data you are collecting to it.

Note that df.rdd will return an RDD[Row].

With Spark <2, you can use databricks spark-csv library:

Spark 1.4+:

Click to copy

df.write.format("com.databricks.spark.csv").save(filepath)

Spark 1.3:

Click to copy

df.save(filepath,"com.databricks.spark.csv")

With Spark 2.x the spark-csv package is not needed as it's included in Spark.

Click to copy
```
df.write.format("csv").save(filepath) 
```
You can convert to local Pandas data frame and use to_csv method (PySpark only).

Note: Solutions 1, 2 and 3 will result in CSV format files (part-*) generated by the underlying Hadoop API that Spark calls when you invoke save. You will have one part- file per partition.

answered Sep 29 '22 21:09

eliasah

Writing dataframe to disk as csv is similar read from csv. If you want your result as one file, you can use coalesce.

Click to copy

df.coalesce(1)       .write       .option("header","true")       .option("sep",",")       .mode("overwrite")       .csv("output/path")

If your result is an array you should use language specific solution, not spark dataframe api. Because all these kind of results return driver machine.

answered Sep 29 '22 21:09

Erkan Şirin

Related questions
                            
                                How to create an immutable map/set from a seq?
                            
                                How to compare two strings in scala?
                            
                                Convert a List of Options to an Option of List using Scalaz
                            
                                Does Scala have an API method that converts a Seq[Option[T]] to Seq[T]?
                            
                                Spark dataframe get column value into a string variable
                            
                                Converting a Scala Map to a List
                            
                                Conversion from scala parallel collection to regular collection
                            
                                println in scala for-comprehension
                            
                                Re-download a SNAPSHOT version of a dependency using SBT
                            
                                Visitor Pattern in Scala
                            
                                How to combine Option values in Scala?
                            
                                Adding new task dependencies to built-in SBT tasks?
                            
                                Scala equivalent of Java's Number
                            
                                GUI in Scala/Groovy/Clojure
                            
                                Iterate rows and columns in Spark dataframe
                            
                                How to add Jar libraries to an IntelliJ Idea SBT Scala project?
                            
                                Convert from scala.collection.Seq<String> to java.util.List<String> in Java code
                            
                                Convert Option to Either in Scala
                            
                                Optionally adding items to a Scala Map
                            
                                String pattern matching best practice

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to save a spark DataFrame as csv on disk?

Tags:

scala

apache-spark

apache-spark-sql

Hello lad

People also ask

2 Answers

eliasah

Erkan Şirin

Recent Activity

Donate For Us