I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived column but blank in Age column then I will keep it as null.
I tried to use &&
operator but it didn't work. Here is my code:
tdata.withColumn("Age", when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
Any suggestions how to handle that? Thanks.
Error Message:
SyntaxError: invalid syntax
File "<ipython-input-33-3e691784411c>", line 1
tdata.withColumn("Age", when((tdata.Age == "" && tdata.Survived == "0"), mean_age_0).otherwise(tdata.Age)).show()
^
when in pyspark multiple conditions can be built using &(for and) and | (for or).
PySpark when() is SQL function, in order to use this first you should import and this returns a Column type, otherwise() is a function of Column , when otherwise() not used and none of the conditions met it assigns None (Null) value. Usage would be like when(condition).
To subset or filter the data from the dataframe we are using the filter() function. The filter function is used to filter the data from the dataframe on the basis of the given condition it should be single or multiple. where df is the dataframe from which the data is subset or filtered.
If you want to select a subset of rows, one method is to create an index column using monotonically_increasing_id() . From the docs: The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
You get SyntaxError
error exception because Python has no &&
operator. It has and
and &
where the latter one is the correct choice to create boolean expressions on Column
(|
for a logical disjunction and ~
for logical negation).
Condition you created is also invalid because it doesn't consider operator precedence. &
in Python has a higher precedence than ==
so expression has to be parenthesized.
(col("Age") == "") & (col("Survived") == "0")
## Column<b'((Age = ) AND (Survived = 0))'>
On a side note when
function is equivalent to case
expression not WHEN
clause. Still the same rules apply. Conjunction:
df.where((col("foo") > 0) & (col("bar") < 0))
Disjunction:
df.where((col("foo") > 0) | (col("bar") < 0))
You can of course define conditions separately to avoid brackets:
cond1 = col("Age") == ""
cond2 = col("Survived") == "0"
cond1 & cond2
when in pyspark multiple conditions can be built using &(for and) and | (for or).
Note:In pyspark t is important to enclose every expressions within parenthesis () that combine to form the condition
%pyspark
dataDF = spark.createDataFrame([(66, "a", "4"),
(67, "a", "0"),
(70, "b", "4"),
(71, "d", "4")],
("id", "code", "amt"))
dataDF.withColumn("new_column",
when((col("code") == "a") | (col("code") == "d"), "A")
.when((col("code") == "b") & (col("amt") == "4"), "B")
.otherwise("A1")).show()
In Spark Scala code (&&) or (||) conditions can be used within when function
//scala
val dataDF = Seq(
(66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
)).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
when(col("code") === "a" || col("code") === "d", "A")
.when(col("code") === "b" && col("amt") === "4", "B")
.otherwise("A1")).show()
=======================
Output:
+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66| a| 4| A|
| 67| a| 0| A|
| 70| b| 4| B|
| 71| d| 4| A|
+---+----+---+----------+
This code snippet is copied from sparkbyexamples.com
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With