Table:
+---------+---------+---------+---------+
|id |path |error |message |
+---------+---------+---------+---------+
| 1 | a.a.a| true | "aaa" |
+---------+---------+---------+---------+
| 2 | a.a.a| true | "bbb" |
+---------+---------+---------+---------+
| 2 | a.a.a| true | "bbc" |
+---------+---------+---------+---------+
| 2 | a.a.b| false | "ccc" |
+---------+---------+---------+---------+
I have pySpark query:
data.groupBy('id', 'path')\
.agg(
sum(when(col('error') == 'true', 1).otherwise(0)).alias('count'),
).show()
How can I add a new column with first example element that have col('error') == 'true'? I want a table with elements id, path, count, exampleItem.
Function first().alias('exampleItem') works, but returns elements that do not necessarily match the condition above.
What you can do is define a column that only contains the message in the case where the error is true, and null otherwise using the when function. Then, the first function used with ignorenulls set to true will give you what you expect.
d = [(1, "a.a.a", True, "aaa"), (2, "a.a.a", True, "bbb"),
(2, "a.a.a", True, "bbc"), (2, "a.a.b", False, "ccc")]
data = spark.createDataFrame(d, ['id', 'path', 'error', 'message'])
data\
.groupBy('id', 'path')\
.agg(F.sum(F.when(F.col('error') == 'true', 1).otherwise(0)).alias('count'),
F.first(F.when(F.col('error'), F.col('message')), ignorenulls=True).alias('exampleItem'))\
.show()
+---+-----+-----+-----------+
| id| path|count|exampleItem|
+---+-----+-----+-----------+
| 2|a.a.a| 2| bbb|
| 1|a.a.a| 1| aaa|
| 2|a.a.b| 0| null|
+---+-----+-----+-----------+
For the last row, the null value is due to the fact that no message meet the desired requirement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With