Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark Count includes Nulls

Tags:

pyspark

Running a simple example -

dept = [("Finance",10),("Marketing",None),("Sales",30),("IT",40)]
deptColumns = ["dept_name","dept_id"]
rdd = sc.parallelize(dept)
df = rdd.toDF(deptColumns)
df.show(truncate=False)

print('count the dept_id, should be 3')
print('count: ' + str(df.select(F.col("dept_id")).count()))

We get the following output -

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|null   |
|Sales    |30     |
|IT       |40     |
+---------+-------+

count the dept_id, should be 3
count: 4

I'm running on databricks and this is my stack - Spark 3.0.1 scala 2.12, DBR 7.3 LTS

Thanks for any help!!

like image 790
Shahar Katz Avatar asked Apr 02 '26 11:04

Shahar Katz


2 Answers

There is a subtle difference between the count function of the Dataframe API and the count function of Spark SQL. The first one simply counts the rows while the second one can ignore null values.

You are using Dataframe.count(). According to the documentation, this function

returns the number of rows in this DataFrame

So the result 4 is correct as there are 4 rows in the dataframe.

If null values should be ignored, you can use the Spark SQL function count which can ignore null values:

count(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.

For example

df.selectExpr("count(dept_id)").show()

returns 3.

like image 107
werner Avatar answered Apr 03 '26 23:04

werner


Another alternative solution to @werner is using the pyspark.sql.functions

from pyspark.sql import functions as F

print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect()))
like image 25
Shahar Katz Avatar answered Apr 04 '26 01:04

Shahar Katz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!