Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark count rows on condition

Tags:

I have a dataframe

test = spark.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330),('bn',2,220),('mb',14520,331)],['x','y','z']) test.show() 

I need to count the rows based on a condition:

test.groupBy("x").agg(count(col("y")>12453),count(col("z")>230)).show() 

which gives

 +---+------------------+----------------+  |  x|count((y > 12453))|count((z > 230))|  +---+------------------+----------------+  | bn|                 2|               2|  | mb|                 2|               2|  +---+------------------+----------------+ 

It's just the count of the rows not the rows for certain conditions.

like image 436
newleaf Avatar asked Feb 28 '18 04:02

newleaf


People also ask

How do you count rows in PySpark?

For counting the number of rows we are using the count() function df. count() which extracts the number of rows from the Dataframe and storing it in the variable named as 'row' For counting the number of columns we are using df.

How do you use count in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you use the count function in spark?

Method -1 : Using select() method If we want to return the count from multiple columns, we have to use the count () method inside the select() method by specifying the column name separated by a comma. Where, df is the input PySpark DataFrame. column_name is the column to get the total number of rows (count).


1 Answers

count doesn't sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then sum:

import pyspark.sql.functions as F  cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0)) test.groupBy('x').agg(     cnt_cond(F.col('y') > 12453).alias('y_cnt'),      cnt_cond(F.col('z') > 230).alias('z_cnt') ).show() +---+-----+-----+ |  x|y_cnt|z_cnt| +---+-----+-----+ | bn|    0|    0| | mb|    2|    2| +---+-----+-----+ 
like image 160
Psidom Avatar answered Sep 18 '22 13:09

Psidom