Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save a spark DataFrame as csv on disk?

For example, the result of this:

df.filter("project = 'en'").select("title","count").groupBy("title").sum() 

would return an Array.

How to save a spark DataFrame as a csv file on disk ?

like image 398
Hello lad Avatar asked Oct 16 '15 15:10

Hello lad


People also ask

How do I save DataFrame in memory in Spark?

Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions. When you persist a dataset, each node stores its partitioned data in memory and reuses them in other actions on that dataset.


2 Answers

Apache Spark does not support native CSV output on disk.

You have four available solutions though:

  1. You can convert your Dataframe into an RDD :

    def convertToReadableString(r : Row) = ??? df.rdd.map{ convertToReadableString }.saveAsTextFile(filepath) 

    This will create a folder filepath. Under the file path, you'll find partitions files (e.g part-000*)

    What I usually do if I want to append all the partitions into a big CSV is

    cat filePath/part* > mycsvfile.csv 

    Some will use coalesce(1,false) to create one partition from the RDD. It's usually a bad practice, since it may overwhelm the driver by pulling all the data you are collecting to it.

    Note that df.rdd will return an RDD[Row].

  2. With Spark <2, you can use databricks spark-csv library:

    • Spark 1.4+:

      df.write.format("com.databricks.spark.csv").save(filepath) 
    • Spark 1.3:

      df.save(filepath,"com.databricks.spark.csv") 
  3. With Spark 2.x the spark-csv package is not needed as it's included in Spark.

    df.write.format("csv").save(filepath) 
  4. You can convert to local Pandas data frame and use to_csv method (PySpark only).

Note: Solutions 1, 2 and 3 will result in CSV format files (part-*) generated by the underlying Hadoop API that Spark calls when you invoke save. You will have one part- file per partition.

like image 73
eliasah Avatar answered Sep 29 '22 21:09

eliasah


Writing dataframe to disk as csv is similar read from csv. If you want your result as one file, you can use coalesce.

df.coalesce(1)       .write       .option("header","true")       .option("sep",",")       .mode("overwrite")       .csv("output/path") 

If your result is an array you should use language specific solution, not spark dataframe api. Because all these kind of results return driver machine.

like image 28
Erkan Şirin Avatar answered Sep 29 '22 21:09

Erkan Şirin