Pyspark Count includes Nulls

Question

Running a simple example -

dept = [("Finance",10),("Marketing",None),("Sales",30),("IT",40)]
deptColumns = ["dept_name","dept_id"]
rdd = sc.parallelize(dept)
df = rdd.toDF(deptColumns)
df.show(truncate=False)

print('count the dept_id, should be 3')
print('count: ' + str(df.select(F.col("dept_id")).count()))

We get the following output -

+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance  |10     |
|Marketing|null   |
|Sales    |30     |
|IT       |40     |
+---------+-------+

count the dept_id, should be 3
count: 4

I'm running on databricks and this is my stack - Spark 3.0.1 scala 2.12, DBR 7.3 LTS

Thanks for any help!!

werner · Accepted Answer

There is a subtle difference between the count function of the Dataframe API and the count function of Spark SQL. The first one simply counts the rows while the second one can ignore null values.

You are using Dataframe.count(). According to the documentation, this function

returns the number of rows in this DataFrame

So the result 4 is correct as there are 4 rows in the dataframe.

If null values should be ignored, you can use the Spark SQL function count which can ignore null values:

count(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.

For example

df.selectExpr("count(dept_id)").show()

returns 3.

Shahar Katz · Answer

Another alternative solution to @werner is using the pyspark.sql.functions

from pyspark.sql import functions as F

print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect()))

Pyspark Count includes Nulls

Tags:

pyspark

Shahar Katz

2 Answers

werner

Shahar Katz

Recent Activity

Donate For Us

Pyspark Count includes Nulls

Tags:

pyspark

Shahar Katz

2 Answers

werner

Shahar Katz

Related questions

Recent Activity

Donate For Us