Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala Spark : How to create a RDD from a list of string and convert to DataFrame

I want to create a DataFrame from a list of string that could match existing schema. Here is my code.

    val rowValues = List("ann", "f", "90", "world", "23456") // fails
    val rowValueTuple = ("ann", "f", "90", "world", "23456") //works

    val newRow = sqlContext.sparkContext.parallelize(Seq(rowValueTuple)).toDF(df.columns: _*)

    val newdf = df.unionAll(newRow).show()

The same code fails if i use the List of String. I see the difference is with rowValueTuple a Tuple is created. Since the size of rowValues list dynamically changes, i cannot manually create Tuple* object. How can i do this? What am i missing? How can i flatten this list to meet the requirement?

Appreciate your help, Please.

like image 822
NehaM Avatar asked Apr 21 '16 12:04

NehaM


People also ask

How do you convert a spark RDD into a DataFrame?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

How do I create a list RDD in spark?

There are three ways to create an RDD in Spark. Parallelizing already existing collection in driver program. Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system). Creating RDD from already existing RDDs.


1 Answers

DataFrame has schema with fixed number of columns, so it's seems not natural to make row per list of variable length. Anyway, you can create your DataFrame from RDD[Row] using existing schema, like this:

val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v: _*))
val newRow = sqlContext.createDataFrame(rdd, df.schema)
like image 74
Vitalii Kotliarenko Avatar answered Oct 21 '22 03:10

Vitalii Kotliarenko