So I have a spark dataframe that looks like:
a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7
And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :
a | b_list
5 | (2,4)
2 | (4,3)
How would I go about doing this with a pyspark sql dataframe?
Thank you! :)
Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.
To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.
PySpark – collect_list() collect_list() method is used to get the data from the PySpark DataFrame columns and return the values in Row format. It will return all values along with duplicates. But we need to import this method from pyspark.
Here are the steps to get that Dataframe.
>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 5| 2| 1|
| 5| 4| 3|
| 2| 4| 2|
| 2| 3| 7|
+---+---+---+
>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 5| [2, 4]|
| 2| [4, 3]|
+---+---------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With