I have a dataframe <pre class="prettyprint"><code>test = spark.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330),('bn',2,220),('mb',14520,331)],['x','y','z']) test.show() </code></pre> I need to count the rows based on a condition: <pre class="prettyprint"><code>test.groupBy("x").agg(count(col("y")>12453),count(col("z")>230)).show() </code></pre> which gives <pre class="prettyprint"><code> +---+------------------+----------------+ | x|count((y > 12453))|count((z > 230))| +---+------------------+----------------+ | bn| 2| 2| | mb| 2| 2| +---+------------------+----------------+ </code></pre> It's just the count of the rows not the rows for certain conditions.

<code>count</code> doesn't sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then <code>sum</code>: <pre class="prettyprint lang-py prettyprint-override"><code>import pyspark.sql.functions as F cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0)) test.groupBy('x').agg( cnt_cond(F.col('y') > 12453).alias('y_cnt'), cnt_cond(F.col('z') > 230).alias('z_cnt') ).show() +---+-----+-----+ | x|y_cnt|z_cnt| +---+-----+-----+ | bn| 0| 0| | mb| 2| 2| +---+-----+-----+ </code></pre>

pyspark count rows on condition

Tags:

I have a dataframe

Click to copy

test = spark.createDataFrame([('bn', 12452, 221), ('mb', 14521, 330),('bn',2,220),('mb',14520,331)],['x','y','z']) test.show()

I need to count the rows based on a condition:

Click to copy

test.groupBy("x").agg(count(col("y")>12453),count(col("z")>230)).show()

which gives

Click to copy

 +---+------------------+----------------+  |  x|count((y > 12453))|count((z > 230))|  +---+------------------+----------------+  | bn|                 2|               2|  | mb|                 2|               2|  +---+------------------+----------------+

It's just the count of the rows not the rows for certain conditions.

436

asked Feb 28 '18 04:02

newleaf

1 Answers

count doesn't sum Trues, it only counts the number of non null values. To count the True values, you need to convert the conditions to 1 / 0 and then sum:

Click to copy

import pyspark.sql.functions as F  cnt_cond = lambda cond: F.sum(F.when(cond, 1).otherwise(0)) test.groupBy('x').agg(     cnt_cond(F.col('y') > 12453).alias('y_cnt'),      cnt_cond(F.col('z') > 230).alias('z_cnt') ).show() +---+-----+-----+ |  x|y_cnt|z_cnt| +---+-----+-----+ | bn|    0|    0| | mb|    2|    2| +---+-----+-----+

160

answered Sep 18 '22 13:09

Psidom

Related questions
                            
                                Basic React form submit refreshes entire page
                            
                                Center UIView vertically in scroll view when its dynamic Labels are small enough, but align it to the top once they are not
                            
                                Using pip install to install Cartopy but missing Proj version at least 4.9.0
                            
                                How do you update the iOS & Android app version in Ionic w/ Capacitor?
                            
                                How to get rid of cryptography build error?
                            
                                Can We use threading in PL/SQL?
                            
                                How to use Linq to group every N number of rows
                            
                                How to tell whether a variable has been initialized in C#?
                            
                                Why can't I call an array method on a function's arguments?
                            
                                html: hover table column [duplicate]
                            
                                How can I check if a file is mp3 or image file?
                            
                                Displaying a pdf file from Winform

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark count rows on condition

Tags:

newleaf

People also ask

1 Answers

Psidom

Recent Activity

Donate For Us