How to add a column to Dataset without converting from a DataFrame and accessing it?

Tags:

apache-spark

I am aware of method to add a new column to a Spark DataSet using .withColumn() and a UDF, which returns a DataFrame. I am also aware that, we can convert the resulting DataFrame to a DataSet.

My questions are:

How does DataSet's type safety comes into play here, if we are still following traditional DF approach (i.e passing column names as a string for UDF's input)
Is there an "Object Oriented Way" of accessing columns(without passing column names as a string) like we used to do with RDD, for appending a new column.
How to access the new column in normal operations like map, filter etc?

For example:

    scala> case class Temp(a : Int, b : String)    //creating case class
    scala> val df = Seq((1,"1str"),(2,"2str),(3,"3str")).toDS    // creating DS
    scala> val appendUDF = udf( (b : String) => b + "ing")      // sample UDF

    scala> df.withColumn("c",df("b"))   // adding a new column
    res5: org.apache.spark.sql.DataFrame = [a: int, b: string ... 1 more field]

    scala> res5.as[Temp]   // converting to DS
    res6: org.apache.spark.sql.Dataset[Temp] = [a: int, b: string ... 1 more field]

    scala> res6.map( x =>x.  
    // list of autosuggestion :
    a   canEqual   equals     productArity     productIterator   toString   
    b   copy       hashCode   productElement   productPrefix

the new column c, that i have added using .withColumn() is not accessible, Because column c is not in the case class Temp (it contains only a & b) at the instant when it is converted to DS using res5.as[Temp].

How to access column c?

358

asked Nov 15 '16 11:11

vdep

1 Answers

In the type-safe world of Datasets you'd map an structure into another.

That is, for each transformation, we need schema representations of the data (as it is needed for RDDs). To access 'c' above, we need to create a new schema that provides access to it.

case class A(a:String)
case class BC(b:String, c:String)
val f:A => BC = a=> BC(a.a,"c") // Transforms an A into a BC

val data = (1 to 10).map(i => A(i.toString))
val dsa = spark.createDataset(data)
// dsa: org.apache.spark.sql.Dataset[A] = [a: string]

val dsb = dsa.map(f)
//dsb: org.apache.spark.sql.Dataset[BC] = [b: string, c: string]

answered Oct 09 '22 05:10

maasg

Related questions
                            
                                Alternatives to java on android [closed]
                            
                                Multiple return points in scala closure/anonymous function
                            
                                Building jar with maven-scala-plugin
                            
                                What is the most functional and ready-to-use SWT API in Scala?
                            
                                In Scala, is there a shorthand for reducing a generic type's arity?
                            
                                Creating an instance of a type alias causes "class type required" error
                            
                                How to do a paged query using ScalaQuery?
                            
                                Clojure futures in context of Scala's concurrency models
                            
                                scala error: found and required are same
                            
                                How to extend ImageView in an Android-Scala app?
                            
                                How to expose REST service for JSON?
                            
                                How to avoid stair-stepping with Monad Transformers in scala?
                            
                                Is it safe for Akka actor become method to close over immutable state?
                            
                                What is the Scala equivalent of Clojure's Atom?
                            
                                Include generated resources in a jar (SBT)
                            
                                scala generics and inheritance
                            
                                ScalaTest on sbt not running any tests
                            
                                Why does Scala evaluate the argument for a call-by-name parameter if the method is infix and right-associative?
                            
                                Scala - why Double consume less memory than Floats in this case?
                            
                                Filtering rows based on column values in spark dataframe scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With