I have a PySpark Dataframe with an <code>A</code> field, few <code>B</code> fields that dependent on <code>A</code> (<code>A->B</code>) and <code>C</code> fields that I want to aggregate per each A. For example: <pre class="prettyprint lang-none prettyprint-override"><code>A | B | C ---------- A | 1 | 6 A | 1 | 7 B | 2 | 8 B | 2 | 4 </code></pre> I wish to group by <code>A</code> , present any of <code>B</code> and run aggregation (let's say <code>SUM</code>) on <code>C</code>. The expected result would be: <pre class="prettyprint lang-none prettyprint-override"><code>A | B | C ---------- A | 1 | 13 B | 2 | 12 </code></pre> SQL-wise I would do: <pre class="prettyprint lang-sql prettyprint-override"><code>SELECT A, COALESCE(B) as B, SUM(C) as C FROM T GROUP BY A </code></pre> What is the PySpark way to do that? I can group by A and B together or select <code>MIN(B)</code> per each A, for example: <pre class="prettyprint lang-py prettyprint-override"><code>df.groupBy('A').agg(F.min('B').alias('B'),F.sum('C').alias('C')) </code></pre> or <pre class="prettyprint"><code>df.groupBy(['A','B']).agg(F.sum('C').alias('C')) </code></pre> but that seems inefficient. Is there is anything similar to SQL <code>coalesce</code> in PySpark? Thanks

You'll just need to use <code>first</code> instead : <pre class="prettyprint"><code>from pyspark.sql.functions import first, sum, col from pyspark.sql import Row array = [Row(A="A", B=1, C=6), Row(A="A", B=1, C=7), Row(A="B", B=2, C=8), Row(A="B", B=2, C=4)] df = sqlContext.createDataFrame(sc.parallelize(array)) results = df.groupBy(col("A")).agg(first(col("B")).alias("B"), sum(col("C")).alias("C")) </code></pre> Let's now check the results : <pre class="prettyprint"><code>results.show() # +---+---+---+ # | A| B| C| # +---+---+---+ # | B| 2| 12| # | A| 1| 13| # +---+---+---+ </code></pre> From the comments: <blockquote> Is <code>first</code> here is computationally equivalent to <code>any</code> ? </blockquote> <code>groupBy</code> causes shuffle. Thus a non deterministic behaviour is to expect. Which is confirmed in the documentation of <code>first</code> : <blockquote> Aggregate function: returns the first value in a group. The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. note:: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. </blockquote> So yes, computationally there are the same, and that's one of the reasons you need to use sorting if you need a deterministic behaviour. I hope this helps !

PySpark aggregation function for "any value"

Tags:

python

apache-spark

apache-spark-sql

pyspark

coalesce

I have a PySpark Dataframe with an A field, few B fields that dependent on A (A->B) and C fields that I want to aggregate per each A. For example:

Click to copy

A | B | C
----------
A | 1 | 6
A | 1 | 7
B | 2 | 8
B | 2 | 4

I wish to group by A , present any of B and run aggregation (let's say SUM) on C.

The expected result would be:

Click to copy

A | B | C
----------
A | 1 | 13
B | 2 | 12

SQL-wise I would do:

Click to copy

SELECT A, COALESCE(B) as B, SUM(C) as C
FROM T
GROUP BY A

What is the PySpark way to do that?

I can group by A and B together or select MIN(B) per each A, for example:

Click to copy

df.groupBy('A').agg(F.min('B').alias('B'),F.sum('C').alias('C'))

Click to copy

df.groupBy(['A','B']).agg(F.sum('C').alias('C'))

but that seems inefficient. Is there is anything similar to SQL coalesce in PySpark?

Thanks

492

asked Feb 25 '18 12:02

Dimgold

1 Answers

You'll just need to use first instead :

Click to copy

from pyspark.sql.functions import first, sum, col
from pyspark.sql import Row

array = [Row(A="A", B=1, C=6),
         Row(A="A", B=1, C=7),
         Row(A="B", B=2, C=8),
         Row(A="B", B=2, C=4)]
df = sqlContext.createDataFrame(sc.parallelize(array))

results = df.groupBy(col("A")).agg(first(col("B")).alias("B"), sum(col("C")).alias("C"))

Let's now check the results :

Click to copy

results.show()
# +---+---+---+
# |  A|  B|  C|
# +---+---+---+
# |  B|  2| 12|
# |  A|  1| 13|
# +---+---+---+

From the comments:

Is first here is computationally equivalent to any ?

groupBy causes shuffle. Thus a non deterministic behaviour is to expect.

Which is confirmed in the documentation of first :

Aggregate function: returns the first value in a group. The function by default returns the first values it sees. It will return the first non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned. note:: The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.

So yes, computationally there are the same, and that's one of the reasons you need to use sorting if you need a deterministic behaviour.

I hope this helps !

answered Oct 17 '22 15:10

eliasah

Related questions
                            
                                How to remove duplicates from correlation in pandas?
                            
                                Errors while using sagemaker api to invoke endpoints
                            
                                In the Django REST framework, how to allow partial updates when using a ModeViewSet?
                            
                                Pandas: nan->None
                            
                                AttributeError: 'InputLayer' object has no attribute 'inbound_nodes'
                            
                                How do I pass in a python pandas.Dataframe object as an argument to a celery task?
                            
                                Converting String dictionary like to dictionary [duplicate]
                            
                                Python, Zeep response to pandas
                            
                                How to setup virtual environment for python(2.7, 3.5, 3.6) on Ubuntu16.04 LTS?
                            
                                PyQt QML Material Design button background won't change
                            
                                Searching for all matches in texts with Pandas
                            
                                Python matplotlib clockwise pie charts
                            
                                Difference between DataFrame.div and DataFrame.divide in pandas
                            
                                Pandas find Row number for matching values
                            
                                How do I create a Keras Embedding layer from a pre-trained word embedding dataset?
                            
                                unzip a dictionary of coordinates and values
                            
                                Replace other columns of duplicate rows with first unique value and create lookup
                            
                                PyTorch: How do the means and stds get calculated in the Transfer Learning tutorial?
                            
                                Make Django REST API accept a list
                            
                                Convert dictionary values to dictionary key-value pairs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark aggregation function for "any value"

Tags:

python

apache-spark

apache-spark-sql

pyspark

coalesce

Dimgold

People also ask

1 Answers

eliasah

Recent Activity

Donate For Us