Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark collect_set or collect_list with groupby

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'

like image 534
Hanan Shteingart Avatar asked Jun 02 '16 00:06

Hanan Shteingart


People also ask

Does Collect_list preserve order?

Does it mean collect_list also maintains the order? In your code, you sort the entire dataset before collect_list() so yes.

How do you use groupBy and count in PySpark?

When we perform groupBy() on PySpark Dataframe, it returns GroupedData object which contains below aggregate functions. count() – Use groupBy() count() to return the number of rows for each group. mean() – Returns the mean of values for each group. max() – Returns the maximum of values for each group.

What is Collect_list?

Spark SQL collect_list() and collect_set() functions are used to create an array (ArrayType) column on DataFrame by merging rows, typically after group by or window partitions.


Video Answer


1 Answers

You need to use agg. Example:

from pyspark import SparkContext from pyspark.sql import HiveContext from pyspark.sql import functions as F  sc = SparkContext("local")  sqlContext = HiveContext(sc)  df = sqlContext.createDataFrame([     ("a", None, None),     ("a", "code1", None),     ("a", "code2", "name2"), ], ["id", "code", "name"])  df.show()  +---+-----+-----+ | id| code| name| +---+-----+-----+ |  a| null| null| |  a|code1| null| |  a|code2|name2| +---+-----+-----+ 

Note in the above you have to create a HiveContext. See https://stackoverflow.com/a/35529093/690430 for dealing with different Spark versions.

(df   .groupby("id")   .agg(F.collect_set("code"),        F.collect_list("name"))   .show())  +---+-----------------+------------------+ | id|collect_set(code)|collect_list(name)| +---+-----------------+------------------+ |  a|   [code1, code2]|           [name2]| +---+-----------------+------------------+ 
like image 52
Kamil Sindi Avatar answered Sep 16 '22 21:09

Kamil Sindi