Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing a RDD to a csv

I have a RDD which is of the form

org.apache.spark.rdd.RDD[(String, Array[String])]

I want to write this into a csv file. Please suggest me how this can be done.

Doing myrdd.saveAsTextFile on gives the output as below.

(875,[Ljava.lang.String;@53620618)
(875,[Ljava.lang.String;@487e3c6c)
like image 982
Kundan Kumar Avatar asked Feb 03 '15 08:02

Kundan Kumar


People also ask

How do I save an RDD file as a CSV?

1 Answer. Just map the lines of the RDD (labelsAndPredictions) into strings (the lines of the CSV) then use rdd. saveAsTextFile().

How do I convert a Spark DataFrame to a CSV file?

In Spark, you can save (write/extract) a DataFrame to a CSV file on disk by using dataframeObj. write. csv("path") , using this you can also write DataFrame to AWS S3, Azure Blob, HDFS, or any Spark supported file systems.

How do you write RDD?

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.


2 Answers

You can try:

myrdd.map(a => a._1 + "," + a._2.mkString(",")).saveAsTextFile
like image 130
Szymon Avatar answered Nov 12 '22 06:11

Szymon


The other answer doesn't cater for escaping. Perhaps this more general solution?

import au.com.bytecode.opencsv.CSVWriter
import java.io.StringWriter
import scala.collection.JavaConversions._
val toCsv = (a: Array[String]) => {
  val buf = new StringWriter
  val writer = new CSVWriter(buf)
  writer.writeAll(List(a))
  buf.toString.trim
}
rdd.map(t => Array(t._1) ++ t._2)
   .map(a => toCsv(a))
   .saveAsTextFile(dest)
like image 21
Alister Lee Avatar answered Nov 12 '22 07:11

Alister Lee