For example, the result of this:
df.filter("project = 'en'").select("title","count").groupBy("title").sum()
would return an Array.
How to save a spark DataFrame as a csv file on disk ?
Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions. When you persist a dataset, each node stores its partitioned data in memory and reuses them in other actions on that dataset.
Apache Spark does not support native CSV output on disk.
You have four available solutions though:
You can convert your Dataframe into an RDD :
def convertToReadableString(r : Row) = ??? df.rdd.map{ convertToReadableString }.saveAsTextFile(filepath)
This will create a folder filepath. Under the file path, you'll find partitions files (e.g part-000*)
What I usually do if I want to append all the partitions into a big CSV is
cat filePath/part* > mycsvfile.csv
Some will use coalesce(1,false)
to create one partition from the RDD. It's usually a bad practice, since it may overwhelm the driver by pulling all the data you are collecting to it.
Note that df.rdd
will return an RDD[Row]
.
With Spark <2, you can use databricks spark-csv library:
Spark 1.4+:
df.write.format("com.databricks.spark.csv").save(filepath)
Spark 1.3:
df.save(filepath,"com.databricks.spark.csv")
With Spark 2.x the spark-csv
package is not needed as it's included in Spark.
df.write.format("csv").save(filepath)
You can convert to local Pandas data frame and use to_csv
method (PySpark only).
Note: Solutions 1, 2 and 3 will result in CSV format files (part-*
) generated by the underlying Hadoop API that Spark calls when you invoke save
. You will have one part-
file per partition.
Writing dataframe to disk as csv is similar read from csv. If you want your result as one file, you can use coalesce.
df.coalesce(1) .write .option("header","true") .option("sep",",") .mode("overwrite") .csv("output/path")
If your result is an array you should use language specific solution, not spark dataframe api. Because all these kind of results return driver machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With