I want to create a DataFrame from a list of string that could match existing schema. Here is my code.
val rowValues = List("ann", "f", "90", "world", "23456") // fails
val rowValueTuple = ("ann", "f", "90", "world", "23456") //works
val newRow = sqlContext.sparkContext.parallelize(Seq(rowValueTuple)).toDF(df.columns: _*)
val newdf = df.unionAll(newRow).show()
The same code fails if i use the List of String. I see the difference is with rowValueTuple
a Tuple
is created.
Since the size of rowValues
list dynamically changes, i cannot manually create Tuple*
object.
How can i do this? What am i missing? How can i flatten this list to meet the requirement?
Appreciate your help, Please.
Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.
There are three ways to create an RDD in Spark. Parallelizing already existing collection in driver program. Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system). Creating RDD from already existing RDDs.
DataFrame has schema with fixed number of columns, so it's seems not natural to make row per list of variable length. Anyway, you can create your DataFrame from RDD[Row] using existing schema, like this:
val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v: _*))
val newRow = sqlContext.createDataFrame(rdd, df.schema)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With