pyspark collect_set or collect_list with groupby

Tags:

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'

534

asked Jun 02 '16 00:06

Hanan Shteingart

Video Answer

1 Answers

You need to use agg. Example:

from pyspark import SparkContext from pyspark.sql import HiveContext from pyspark.sql import functions as F  sc = SparkContext("local")  sqlContext = HiveContext(sc)  df = sqlContext.createDataFrame([     ("a", None, None),     ("a", "code1", None),     ("a", "code2", "name2"), ], ["id", "code", "name"])  df.show()  +---+-----+-----+ | id| code| name| +---+-----+-----+ |  a| null| null| |  a|code1| null| |  a|code2|name2| +---+-----+-----+

Note in the above you have to create a HiveContext. See https://stackoverflow.com/a/35529093/690430 for dealing with different Spark versions.

(df   .groupby("id")   .agg(F.collect_set("code"),        F.collect_list("name"))   .show())  +---+-----------------+------------------+ | id|collect_set(code)|collect_list(name)| +---+-----------------+------------------+ |  a|   [code1, code2]|           [name2]| +---+-----------------+------------------+

answered Sep 16 '22 21:09

Kamil Sindi

Related questions
                            
                                What's the C#-idiomatic way for applying an operator across two lists?
                            
                                How can I sort a list of strings in Dart?
                            
                                How can I tell if a python variable is a string or a list?
                            
                                How to automatically add &raquo; (») to <li> elements using CSS?
                            
                                Append a tuple to a list - what's the difference between two ways?
                            
                                Binding Listbox to List<object> in WinForms
                            
                                Destination Array not long enough?
                            
                                What does "list comprehension" mean? How does it work and how can I use it?
                            
                                Sum a list of matrices [duplicate]
                            
                                How to zip two differently sized lists?
                            
                                python pandas flatten a dataframe to a list
                            
                                How to add an item to a list in Kotlin?
                            
                                How to convert a List into a Map in Dart
                            
                                Python: Random numbers into a list
                            
                                Get a list of numbers as input from the user
                            
                                How does one convert a HashMap to a List in Java?
                            
                                Remove the last N elements of a list
                            
                                Python: Find index of minimum item in list of floats [duplicate]
                            
                                Python Array Slice With Comma?
                            
                                Check if substring is in a list of strings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark collect_set or collect_list with groupby

Tags:

list

group-by

set

collect

pyspark

Hanan Shteingart

People also ask

Video Answer

1 Answers

Kamil Sindi

Recent Activity

Donate For Us