Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: Add column to dataframe conditionally

I am trying to take my input data:

A    B       C -------------- 4    blah    2 2            3 56   foo     3 

And add a column to the end based on whether B is empty or not:

A    B       C     D -------------------- 4    blah    2     1 2            3     0 56   foo     3     1 

I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.

But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.

I've tried .withColumn, but I can't get that to do what I want.

like image 555
mcmcmc Avatar asked Jan 20 '16 19:01

mcmcmc


People also ask

How do I add a column based on a condition in PySpark?

Add New Column with Constant Value In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I add a column to spark DataFrame?

You can add multiple columns to Spark DataFrame in several ways if you wanted to add a known set of columns you can easily do by chaining withColumn() or on select(). However, sometimes you may need to add multiple columns after applying some transformations n that case you can use either map() or foldLeft().

How do I add multiple columns to a DataFrame in PySpark?

Let's create a new column with constant value using lit() SQL function, on the below code. The lit() function present in Pyspark is used to add a new column in a Pyspark Dataframe by assigning a constant or literal value.


1 Answers

Try withColumn with the function when as follows:

val sqlContext = new SQLContext(sc) import sqlContext.implicits._ // for `toDF` and $"" import org.apache.spark.sql.functions._ // for `when`  val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))     .toDF("A", "B", "C")  val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1)) 

newDf.show() shows

+---+----+---+---+ |  A|   B|  C|  D| +---+----+---+---+ |  4|blah|  2|  1| |  2|    |  3|  0| | 56| foo|  3|  1| |100|null|  5|  0| +---+----+---+---+ 

I added the (100, null, 5) row for testing the isNull case.

I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.

like image 137
emeth Avatar answered Oct 07 '22 08:10

emeth