I am trying to get some counts on a DataFrame using agg and count.
from pyspark.sql import Row ,functions as F
row = Row("Cat","Date")
df = (sc.parallelize
([
row("A",'2017-03-03'),
row('A',None),
row('B','2017-03-04'),
row('B','Garbage'),
row('A','2016-03-04')
]).toDF())
df = df.withColumn("Casted", df['Date'].cast('date'))
df.show()
(
df.groupby(df['Cat'])
.agg
(
#F.count(col('Date').isNull() | col('Date').isNotNull()).alias('Date_Count'),
F.count('Date').alias('Date_Count'),
F.count('Casted').alias('Valid_Date_Count')
)
.show()
)
The function F.count() is giving me only the non-null count. Is there a way to get the count including nulls other than using an 'OR' condition.
The invalid count doesn't seem to work. The & condition doesn't look to be working as expected.
(
df
.groupby(df['Cat'])
.agg
(
F.count('*').alias('count'),
F.count('Date').alias('Date_Count'),
F.count('Casted').alias('Valid_Date_Count'),
F.count(col('Date').isNotNull() & col('Casted').isNull()).alias('invalid')
)
.show()
)
Cast the boolean expression as an int
and sum
it
df\
.groupby(df['Cat'])\
.agg (
F.count('Date').alias('Date_Count'),
F.count('Casted').alias('Valid_Date_Count'),
F.sum((~F.isnull('Date')&F.isnull("Casted")).cast("int")).alias("Invalid_Date_Cound")
).show()
+---+----------+----------------+------------------+
|Cat|Date_Count|Valid_Date_Count|Invalid_Date_Cound|
+---+----------+----------------+------------------+
| B| 2| 1| 1|
| A| 2| 2| 0|
+---+----------+----------------+------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With