Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GroupByKey and create lists of values pyspark sql dataframe

So I have a spark dataframe that looks like:

a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7

And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :

a | b_list
5 | (2,4)
2 | (4,3)

How would I go about doing this with a pyspark sql dataframe?

Thank you! :)

like image 473
user2253546 Avatar asked Dec 03 '16 07:12

user2253546


People also ask

How do you use groupByKey in PySpark?

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.

How do I add a list to a DataFrame in PySpark?

To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.

How do you collect a list in PySpark?

PySpark – collect_list() collect_list() method is used to get the data from the PySpark DataFrame columns and return the values in Row format. It will return all values along with duplicates. But we need to import this method from pyspark.


1 Answers

Here are the steps to get that Dataframe.

>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  5|  2|  1|
|  5|  4|  3|
|  2|  4|  2|
|  2|  3|  7|
+---+---+---+

>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
|  a|collect_list(b)|
+---+---------------+
|  5|         [2, 4]|
|  2|         [4, 3]|
+---+---------------+
like image 75
abaghel Avatar answered Nov 01 '22 23:11

abaghel