I have a dataframe
test = spark.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330),('bn',2,220),('mb',14520,331)],['x','y','z']) test.show()
I need to count the rows based on a condition:
test.groupBy("x").agg(count(col("y")>12453),count(col("z")>230)).show()
which gives
+---+------------------+----------------+ | x|count((y > 12453))|count((z > 230))| +---+------------------+----------------+ | bn| 2| 2| | mb| 2| 2| +---+------------------+----------------+
It's just the count of the rows not the rows for certain conditions.
For counting the number of rows we are using the count() function df. count() which extracts the number of rows from the Dataframe and storing it in the variable named as 'row' For counting the number of columns we are using df.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
Method -1 : Using select() method If we want to return the count from multiple columns, we have to use the count () method inside the select() method by specifying the column name separated by a comma. Where, df is the input PySpark DataFrame. column_name is the column to get the total number of rows (count).
count
doesn't sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then sum
:
import pyspark.sql.functions as F cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0)) test.groupBy('x').agg( cnt_cond(F.col('y') > 12453).alias('y_cnt'), cnt_cond(F.col('z') > 230).alias('z_cnt') ).show() +---+-----+-----+ | x|y_cnt|z_cnt| +---+-----+-----+ | bn| 0| 0| | mb| 2| 2| +---+-----+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With