Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write to CSV in Spark

I'm trying to find an effective way of saving the result of my Spark Job as a csv file. I'm using Spark with Hadoop and so far all my files are saved as part-00000.

Any ideas how to make my spark saving to file with a specified file name?

like image 297
Karusmeister Avatar asked May 07 '14 20:05

Karusmeister


People also ask

How do I write a CSV file in PySpark?

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.


2 Answers

Since Spark uses Hadoop File System API to write data to files, this is sort of inevitable. If you do

rdd.saveAsTextFile("foo") 

It will be saved as "foo/part-XXXXX" with one part-* file every partition in the RDD you are trying to save. The reason each partition in the RDD is written a separate file is for fault-tolerance. If the task writing 3rd partition (i.e. to part-00002) fails, Spark simply re-run the task and overwrite the partially written/corrupted part-00002, with no effect on other parts. If they all wrote to the same file, then it is much harder recover a single task for failures.

The part-XXXXX files are usually not a problem if you are going to consume it again in Spark / Hadoop-based frameworks because since they all use HDFS API, if you ask them to read "foo", they will all read all the part-XXXXX files inside foo as well.

like image 152
Tathagata Das Avatar answered Sep 19 '22 14:09

Tathagata Das


I'll suggest to do it in this way (Java example):

theRddToPrint.coalesce(1, true).saveAsTextFile(textFileName); FileSystem fs = anyUtilClass.getHadoopFileSystem(rootFolder); FileUtil.copyMerge(     fs, new Path(textFileName),     fs, new Path(textFileNameDestiny),     true, fs.getConf(), null); 
like image 31
adoalonso Avatar answered Sep 20 '22 14:09

adoalonso