I want to create a DataFrame from a list of string that could match existing schema. Here is my code. <pre class="prettyprint"><code> val rowValues = List("ann", "f", "90", "world", "23456") // fails val rowValueTuple = ("ann", "f", "90", "world", "23456") //works val newRow = sqlContext.sparkContext.parallelize(Seq(rowValueTuple)).toDF(df.columns: _*) val newdf = df.unionAll(newRow).show() </code></pre> The same code fails if i use the List of String. I see the difference is with <code>rowValueTuple</code> a <code>Tuple</code> is created. Since the size of <code>rowValues</code> list dynamically changes, i cannot manually create <code>Tuple*</code> object. How can i do this? What am i missing? How can i flatten this list to meet the requirement? Appreciate your help, Please.

DataFrame has schema with fixed number of columns, so it's seems not natural to make row per list of variable length. Anyway, you can create your DataFrame from RDD[Row] using existing schema, like this: <pre class="prettyprint"><code>val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues)) val rowRdd = rdd.map(v => Row(v: _*)) val newRow = sqlContext.createDataFrame(rdd, df.schema) </code></pre>

Scala Spark : How to create a RDD from a list of string and convert to DataFrame

Tags:

dataframe

union-all

scala

apache-spark

rdd

I want to create a DataFrame from a list of string that could match existing schema. Here is my code.

    val rowValues = List("ann", "f", "90", "world", "23456") // fails
    val rowValueTuple = ("ann", "f", "90", "world", "23456") //works

    val newRow = sqlContext.sparkContext.parallelize(Seq(rowValueTuple)).toDF(df.columns: _*)

    val newdf = df.unionAll(newRow).show()

The same code fails if i use the List of String. I see the difference is with rowValueTuple a Tuple is created. Since the size of rowValues list dynamically changes, i cannot manually create Tuple* object. How can i do this? What am i missing? How can i flatten this list to meet the requirement?

Appreciate your help, Please.

822

asked Apr 21 '16 12:04

NehaM

1 Answers

DataFrame has schema with fixed number of columns, so it's seems not natural to make row per list of variable length. Anyway, you can create your DataFrame from RDD[Row] using existing schema, like this:

val rdd = sqlContext.sparkContext.parallelize(Seq(rowValues))
val rowRdd = rdd.map(v => Row(v: _*))
val newRow = sqlContext.createDataFrame(rdd, df.schema)

answered Oct 21 '22 03:10

Vitalii Kotliarenko

Related questions
                            
                                Why does the addition of enclosing parentheses change the result of this Scala expression?
                            
                                Scala: why doesn't List[=>Int] work?
                            
                                How to update present values in a mutable.Map?
                            
                                What's required to have sbt build with Akka?
                            
                                Scala Reflection - Loading or finding classes based on trait
                            
                                Scala UpperBound and LowerBound concept
                            
                                Execute an SBT command from a script
                            
                                Pass arguments to App object in Scala
                            
                                Global onStart on Play Framework 2.3.7 not working?
                            
                                sbt-assembly: Create jar for a single project of a multi-project build
                            
                                What are the rules to govern underscore to define anonymous function?
                            
                                Behavior of flatMap when applied to List[Option[T]]
                            
                                Playframework JSON parsing - Null pointer exception - when array is present
                            
                                What is Scalas Product.productIterator supposed to do?
                            
                                How to test if a Float or Double is infinite or NaN?
                            
                                Spray: routing - understand the difference between path and pathPrefix
                            
                                Why does a method parameter cause NotSerializableException with Mockito?
                            
                                Handling Postgres json datatype in slick, scala
                            
                                Scala - Cannot use a method returning play.api.mvc.Result as a Handler for requests
                            
                                sbt exclusion slf4j not working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With