I have some spark code to process a csv file. It does some transformation on it. I now want to save this RDD as a csv file and add a header. Each line of this RDD is already formatted correctly.
I am not sure how to do it. I wanted to do a union with the header string and my RDD but the header string is not an RDD so it does not work.
You can make an RDD out of your header line and then union
it, yes:
val rdd: RDD[String] = ...
val header: RDD[String] = sc.parallelize(Array("my,header,row"))
header.union(rdd).saveAsTextFile(...)
Then you end up with a bunch of part-xxxxx
files that you merge.
The problem is that I don't think you're guaranteed that the header will be the first partition and therefore end up in part-00000
and at the top of your file. In practice, I'm pretty sure it will.
More reliable would be to use Hadoop commands like hdfs
to merge the part-xxxxx
files, and as part of the command, just throw in the header line from a file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With