Running a simple example -
dept = [("Finance",10),("Marketing",None),("Sales",30),("IT",40)]
deptColumns = ["dept_name","dept_id"]
rdd = sc.parallelize(dept)
df = rdd.toDF(deptColumns)
df.show(truncate=False)
print('count the dept_id, should be 3')
print('count: ' + str(df.select(F.col("dept_id")).count()))
We get the following output -
+---------+-------+
|dept_name|dept_id|
+---------+-------+
|Finance |10 |
|Marketing|null |
|Sales |30 |
|IT |40 |
+---------+-------+
count the dept_id, should be 3
count: 4
I'm running on databricks and this is my stack - Spark 3.0.1 scala 2.12, DBR 7.3 LTS
Thanks for any help!!
There is a subtle difference between the count function of the Dataframe API and the count function of Spark SQL. The first one simply counts the rows while the second one can ignore null values.
You are using Dataframe.count(). According to the documentation, this function
returns the number of rows in this DataFrame
So the result 4 is correct as there are 4 rows in the dataframe.
If null values should be ignored, you can use the Spark SQL function count which can ignore null values:
count(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.
For example
df.selectExpr("count(dept_id)").show()
returns 3.
Another alternative solution to @werner is using the pyspark.sql.functions
from pyspark.sql import functions as F
print('count: ' + str(df.select(F.count(F.col("dept_id"))).collect()))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With