Add a header before text file on save in Spark

Question

I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.

I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.

Sean Owen · Accepted Answer

You can make an RDD out of your header line and then union it, yes:

val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)

Then you end up with a bunch of part-xxxxx files that you merge.

The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000 and at the top of your file. In practice, I'm pretty sure it will.

More reliable would be to use Hadoop commands like hdfs to merge the part-xxxxx files, and as part of the command, just throw in the header line from a file.

Add a header before text file on save in Spark

Tags:

apache-spark

poiuytrez

1 Answers

Sean Owen

Recent Activity

Donate For Us

Add a header before text file on save in Spark

Tags:

apache-spark

poiuytrez

1 Answers

Sean Owen

Related questions

Recent Activity

Donate For Us