GroupByKey and create lists of values pyspark sql dataframe

So I have a spark dataframe that looks like:

a | b | c
5 | 2 | 1
5 | 4 | 3
2 | 4 | 2
2 | 3 | 7

And I want to group by column a, create a list of values from column b, and forget about c. The output dataframe would be :

a | b_list
5 | (2,4)
2 | (4,3)

How would I go about doing this with a pyspark sql dataframe?

Thank you! :)

How do you use groupByKey in PySpark?

Group the values for each key in the RDD into a single sequence. Hash-partitions the resulting RDD with numPartitions partitions. If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will provide much better performance.

How do I add a list to a DataFrame in PySpark?

To do this first create a list of data and a list of column names. Then pass this zipped data to spark. createDataFrame() method. This method is used to create DataFrame.

How do you collect a list in PySpark?

PySpark – collect_list() collect_list() method is used to get the data from the PySpark DataFrame columns and return the values in Row format. It will return all values along with duplicates. But we need to import this method from pyspark.

Here are the steps to get that Dataframe.

>>> from pyspark.sql import functions as F
>>>
>>> d = [{'a': 5, 'b': 2, 'c':1}, {'a': 5, 'b': 4, 'c':3}, {'a': 2, 'b': 4, 'c':2}, {'a': 2, 'b': 3,'c':7}]
>>> df = spark.createDataFrame(d)
>>> df.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  5|  2|  1|
|  5|  4|  3|
|  2|  4|  2|
|  2|  3|  7|
+---+---+---+

>>> df1 = df.groupBy('a').agg(F.collect_list("b"))
>>> df1.show()
+---+---------------+
|  a|collect_list(b)|
+---+---------------+
|  5|         [2, 4]|
|  2|         [4, 3]|
+---+---------------+

GroupByKey and create lists of values pyspark sql dataframe

Tags:

group-by

apache-spark

pyspark-sql

spark-dataframe

user2253546

People also ask

1 Answers

abaghel

Recent Activity

Donate For Us

GroupByKey and create lists of values pyspark sql dataframe

Tags:

group-by

apache-spark

pyspark-sql

spark-dataframe

user2253546

People also ask

1 Answers

abaghel

Related questions

Recent Activity

Donate For Us