Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add a header before text file on save in Spark

Tags:

apache-spark

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.

I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.

like image 315
poiuytrez Avatar asked Oct 02 '14 08:10

poiuytrez


1 Answers

You can make an RDD out of your header line and then union it, yes:

val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)

Then you end up with a bunch of part-xxxxx files that you merge.

The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.

More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.

like image 116
Sean Owen Avatar answered Sep 29 '22 07:09

Sean Owen