I am going to add new column to a dataframe with expression. for example, I have a dataframe of <pre class="prettyprint"><code>+-----+----------+----------+-----+ | C1 | C2 | C3 |C4 | +-----+----------+----------+-----+ |steak|1 |1 | 150| |steak|2 |2 | 180| | fish|3 |3 | 100| +-----+----------+----------+-----+ </code></pre> and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database. Is there a good way to do this? I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it. And normally I am using df.withColumn to add new column. Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.

This can be done using <code>expr</code> to create a <code>Column</code> from an expression: <pre class="prettyprint"><code>val df = Seq((1,2)).toDF("x","y") val myExpression = "x+y" import org.apache.spark.sql.functions.expr df.withColumn("z",expr(myExpression)).show() +---+---+---+ | x| y| z| +---+---+---+ | 1| 2| 3| +---+---+---+ </code></pre>

Scala add new column to dataframe by expression

Tags:

dataframe

scala

apache-spark

I am going to add new column to a dataframe with expression. for example, I have a dataframe of

+-----+----------+----------+-----+
| C1  | C2       |   C3     |C4   |
+-----+----------+----------+-----+
|steak|1         |1         |  150|
|steak|2         |2         |  180|
| fish|3         |3         |  100|
+-----+----------+----------+-----+

and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.

Is there a good way to do this?

I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.

And normally I am using df.withColumn to add new column.

Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.

999

asked Sep 07 '17 03:09

Robin Wang

2 Answers

This can be done using expr to create a Column from an expression:

val df = Seq((1,2)).toDF("x","y")

val myExpression = "x+y"

import org.apache.spark.sql.functions.expr

df.withColumn("z",expr(myExpression)).show()

+---+---+---+
|  x|  y|  z|
+---+---+---+
|  1|  2|  3|
+---+---+---+

111

answered Sep 28 '22 11:09

Raphael Roth

Two approaches:

    import spark.implicits._ //so that you could use .toDF
    val df = Seq(
      ("steak", 1, 1, 150),
      ("steak", 2, 2, 180),
      ("fish", 3, 3, 100)
    ).toDF("C1", "C2", "C3", "C4")

    import org.apache.spark.sql.functions._

    // 1st approach using expr
    df.withColumn("C5", expr("C2/(C3 + C4)")).show()

    // 2nd approach using selectExpr
    df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()

+-----+---+---+---+--------------------+
|   C1| C2| C3| C4|                  C5|
+-----+---+---+---+--------------------+
|steak|  1|  1|150|0.006622516556291391|
|steak|  2|  2|180| 0.01098901098901099|
| fish|  3|  3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+

answered Sep 28 '22 11:09

rajesh-nitc

Related questions
                            
                                Scala 2.10 - Octal escape is deprecated - how to do octal idiomatically now?
                            
                                Union of two sets in Scala
                            
                                SBT: How to access environment variable or configuration?
                            
                                Object is not a value error in scala
                            
                                Which library is the best to use for MongoDB with Scala? [closed]
                            
                                How can I obtain the default value for a type in Scala?
                            
                                How are co- and contra-variance used in designing business applications?
                            
                                How can a parameter's default value reference another parameter?
                            
                                How to COUNT(*) in Slick 2.0?
                            
                                Turn Slick logging off
                            
                                Comparing Haskell and Scala Bind/Flatmap Examples
                            
                                How do I convert a WrappedArray column in spark dataframe to Strings?
                            
                                Correct use of Akka http client connection pools
                            
                                How to avoid double logging with logback? [duplicate]
                            
                                Using scala actor framework as fork-join computation?
                            
                                Semantics of abstract traits in Scala
                            
                                serializing objects to json with play.api.libs.json
                            
                                Pattern matching on a list in Scala
                            
                                Make an arbitrary class in Scala as a monad instance
                            
                                Scala - delete file if exist, the Scala way

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With