I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: <pre class="prettyprint"><code>------------------------ id | date | value ------------------------ 1 |2014-01-03 | 10 1 |2014-01-04 | 5 1 |2014-01-05 | 15 1 |2014-01-06 | 20 2 |2014-02-10 | 100 2 |2014-03-11 | 500 2 |2014-04-15 | 1500 </code></pre> The expected output is: <pre class="prettyprint"><code>id | value_list ------------------------ 1 | [10, 5, 15, 20] 2 | [100, 500, 1500] </code></pre> The values within a list are sorted by the date. I tried using collect_list as follows: <pre class="prettyprint"><code>from pyspark.sql import functions as F ordered_df = input_df.orderBy(['id','date'],ascending = True) grouped_df = ordered_df.groupby("id").agg(F.collect_list("value")) </code></pre> But collect_list doesn't guarantee order even if I sort the input data frame by date before aggregation. Could someone help on how to do aggregation by preserving the order based on a second (date) variable?

<pre class="prettyprint"><code>from pyspark.sql import functions as F from pyspark.sql import Window w = Window.partitionBy('id').orderBy('date') sorted_list_df = input_df.withColumn( 'sorted_list', F.collect_list('value').over(w) )\ .groupBy('id')\ .agg(F.max('sorted_list').alias('sorted_list')) </code></pre> <code>Window</code> examples provided by users often don't really explain what is going on so let me dissect it for you. As you know, using <code>collect_list</code> together with <code>groupBy</code> will result in an unordered list of values. This is because depending on how your data is partitioned, Spark will append values to your list as soon as it finds a row in the group. The order then depends on how Spark plans your aggregation over the executors. A <code>Window</code> function allows you to control that situation, grouping rows by a certain value so you can perform an operation <code>over</code> each of the resultant groups: <pre class="prettyprint"><code>w = Window.partitionBy('id').orderBy('date') </code></pre> <ul> <li> <code>partitionBy</code> - you want groups/partitions of rows with the same <code>id</code> </li> <li> <code>orderBy</code> - you want each row in the group to be sorted by <code>date</code> </li> </ul> Once you have defined the scope of your Window - "rows with the same <code>id</code>, sorted by <code>date</code>" -, you can use it to perform an operation over it, in this case, a <code>collect_list</code>: <pre class="prettyprint"><code>F.collect_list('value').over(w) </code></pre> At this point you created a new column <code>sorted_list</code> with an ordered list of values, sorted by date, but you still have duplicated rows per <code>id</code>. To trim out the duplicated rows you want to <code>groupBy</code> <code>id</code> and keep the <code>max</code> value in for each group: <pre class="prettyprint"><code>.groupBy('id')\ .agg(F.max('sorted_list').alias('sorted_list')) </code></pre>

collect_list by preserving order based on another variable

Tags:

python

apache-spark

pyspark

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below:

------------------------ id | date        | value ------------------------ 1  |2014-01-03   | 10  1  |2014-01-04   | 5 1  |2014-01-05   | 15 1  |2014-01-06   | 20 2  |2014-02-10   | 100    2  |2014-03-11   | 500 2  |2014-04-15   | 1500

The expected output is:

id | value_list ------------------------ 1  | [10, 5, 15, 20] 2  | [100, 500, 1500]

The values within a list are sorted by the date.

I tried using collect_list as follows:

from pyspark.sql import functions as F ordered_df = input_df.orderBy(['id','date'],ascending = True) grouped_df = ordered_df.groupby("id").agg(F.collect_list("value"))

But collect_list doesn't guarantee order even if I sort the input data frame by date before aggregation.

Could someone help on how to do aggregation by preserving the order based on a second (date) variable?

552

asked Oct 05 '17 07:10

Ravi

1 Answers

from pyspark.sql import functions as F from pyspark.sql import Window  w = Window.partitionBy('id').orderBy('date')  sorted_list_df = input_df.withColumn(             'sorted_list', F.collect_list('value').over(w)         )\         .groupBy('id')\         .agg(F.max('sorted_list').alias('sorted_list'))

Window examples provided by users often don't really explain what is going on so let me dissect it for you.

As you know, using collect_list together with groupBy will result in an unordered list of values. This is because depending on how your data is partitioned, Spark will append values to your list as soon as it finds a row in the group. The order then depends on how Spark plans your aggregation over the executors.

A Window function allows you to control that situation, grouping rows by a certain value so you can perform an operation over each of the resultant groups:

w = Window.partitionBy('id').orderBy('date')

partitionBy - you want groups/partitions of rows with the same id
orderBy - you want each row in the group to be sorted by date

Once you have defined the scope of your Window - "rows with the same id, sorted by date" -, you can use it to perform an operation over it, in this case, a collect_list:

F.collect_list('value').over(w)

At this point you created a new column sorted_list with an ordered list of values, sorted by date, but you still have duplicated rows per id. To trim out the duplicated rows you want to groupBy id and keep the max value in for each group:

.groupBy('id')\ .agg(F.max('sorted_list').alias('sorted_list'))

106

answered Oct 01 '22 17:10

TMichel

Related questions
                            
                                Use python requests to download CSV
                            
                                How to use the __import__ function to import a name from a submodule?
                            
                                matplotlib: plot multiple columns of pandas data frame on the bar chart
                            
                                python numpy/scipy curve fitting
                            
                                Python conversion between coordinates
                            
                                ImportError: No module named 'google'
                            
                                using requests with TLS doesn't give SNI support
                            
                                Convert pandas DateTimeIndex to Unix Time?
                            
                                Python: Why does ("hello" is "hello") evaluate as True? [duplicate]
                            
                                I'm trying to use python in powershell
                            
                                Using Windows Python from Cygwin
                            
                                Can't install via pip with Virtualenv
                            
                                Converting Float to Dollars and Cents
                            
                                How do I make a python script executable?
                            
                                Flask url_for generating http URL instead of https
                            
                                Accurate timing of functions in python
                            
                                How to make PyPi description Markdown work?
                            
                                Create a new RGB OpenCV image using Python? [duplicate]
                            
                                Examples for string find in Python
                            
                                How to slice a list from an element n to the end in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With