I have a RDD which is of the form
org.apache.spark.rdd.RDD[(String, Array[String])]
I want to write this into a csv file. Please suggest me how this can be done.
Doing myrdd.saveAsTextFile on gives the output as below.
(875,[Ljava.lang.String;@53620618)
(875,[Ljava.lang.String;@487e3c6c)
1 Answer. Just map the lines of the RDD (labelsAndPredictions) into strings (the lines of the CSV) then use rdd. saveAsTextFile().
In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
You can try:
myrdd.map(a => a._1 + "," + a._2.mkString(",")).saveAsTextFile
The other answer doesn't cater for escaping. Perhaps this more general solution?
import au.com.bytecode.opencsv.CSVWriter
import java.io.StringWriter
import scala.collection.JavaConversions._
val toCsv = (a: Array[String]) => {
val buf = new StringWriter
val writer = new CSVWriter(buf)
writer.writeAll(List(a))
buf.toString.trim
}
rdd.map(t => Array(t._1) ++ t._2)
.map(a => toCsv(a))
.saveAsTextFile(dest)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With