Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scala add new column to dataframe by expression

I am going to add new column to a dataframe with expression. for example, I have a dataframe of

+-----+----------+----------+-----+
| C1  | C2       |   C3     |C4   |
+-----+----------+----------+-----+
|steak|1         |1         |  150|
|steak|2         |2         |  180|
| fish|3         |3         |  100|
+-----+----------+----------+-----+

and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.

Is there a good way to do this?

I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.

And normally I am using df.withColumn to add new column.

Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.

like image 999
Robin Wang Avatar asked Sep 07 '17 03:09

Robin Wang


People also ask

How do I add a new column in Scala?

Using withColumn() to Add a New Column withColumn() function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in Column type.

How do I add a column in spark DataFrame?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

What is withColumn?

The withColumn() function is most extensively used column transformation in Spark SQL DataFrame operations. Here, we will cover all the column operations using withColumn() like changing the current value of a column, converting the data type of an existing column, creating a new column, and many more.

What is Sparklit?

PySpark lit() – Add Literal or Constant to DataFrame PySpark SQL functions lit() and typedLit() are used to add a new column to DataFrame by assigning a literal or constant value. Both these functions return Column type as return type.


2 Answers

This can be done using expr to create a Column from an expression:

val df = Seq((1,2)).toDF("x","y")

val myExpression = "x+y"

import org.apache.spark.sql.functions.expr

df.withColumn("z",expr(myExpression)).show()

+---+---+---+
|  x|  y|  z|
+---+---+---+
|  1|  2|  3|
+---+---+---+
like image 111
Raphael Roth Avatar answered Sep 28 '22 11:09

Raphael Roth


Two approaches:

    import spark.implicits._ //so that you could use .toDF
    val df = Seq(
      ("steak", 1, 1, 150),
      ("steak", 2, 2, 180),
      ("fish", 3, 3, 100)
    ).toDF("C1", "C2", "C3", "C4")

    import org.apache.spark.sql.functions._

    // 1st approach using expr
    df.withColumn("C5", expr("C2/(C3 + C4)")).show()

    // 2nd approach using selectExpr
    df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()

+-----+---+---+---+--------------------+
|   C1| C2| C3| C4|                  C5|
+-----+---+---+---+--------------------+
|steak|  1|  1|150|0.006622516556291391|
|steak|  2|  2|180| 0.01098901098901099|
| fish|  3|  3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+
like image 23
rajesh-nitc Avatar answered Sep 28 '22 11:09

rajesh-nitc