Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to export DataFrame to csv in Scala?

How can I export Spark's DataFrame to csv file using Scala?

like image 851
Tong Avatar asked Sep 11 '15 15:09

Tong


People also ask

How do I convert a DataFrame to CSV in Scala?

Easiest and best way to do this is to use spark-csv library. You can check the documentation in the provided link and here is the scala example of how to load and save data from/to DataFrame. Show activity on this post.

How do I save a DataFrame as CSV in Spark Scala?

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.

How do I create a CSV file in Scala?

For writing the CSV file, we'll use Scala's BufferedWriter, FileWriter and csvWriter. We need to import all the above files before moving forward to deciding a path and giving column headings to our file. We take a few rows of our data to take as input for the training dataset and to use it in writing our CSV file.


3 Answers

In Spark verions 2+ you can simply use the following;

df.write.csv("/your/location/data.csv")

If you want to make sure that the files are no longer partitioned then add a .coalesce(1) as follows;

df.coalesce(1).write.csv("/your/location/data.csv")
like image 87
Taylrl Avatar answered Oct 24 '22 09:10

Taylrl


Easiest and best way to do this is to use spark-csv library. You can check the documentation in the provided link and here is the scala example of how to load and save data from/to DataFrame.

Code (Spark 1.4+):

dataFrame.write.format("com.databricks.spark.csv").save("myFile.csv")

Edit:

Spark creates part-files while saving the csv data, if you want to merge the part-files into a single csv, refer the following:

Merge Spark's CSV output folder to Single File

like image 15
karthik manchala Avatar answered Oct 24 '22 09:10

karthik manchala


Above solution exports csv as multiple partitions. I found another solution by zero323 on this stackoverflow page that exports a dataframe into one single CSV file when you use coalesce.

df.coalesce(1)
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("/your/location/mydata")

This would create a directory named mydata where you'll find a csv file that contains the results.

like image 13
Abu Shoeb Avatar answered Oct 24 '22 07:10

Abu Shoeb