SQL on Spark: How do I get all values of DISTINCT?

Tags:

sql

apache-spark-sql

So, assume I have the following table:

Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue

I would like to get a table of the distinct colors for each name - how many and their values. Meaning, something like this:

Name | Distinct | Values
--------------------------------------
John |   2      | Blue, Yellow
Greg |   2      | Red, Blue

Any ideas how to do so?

632

asked Mar 20 '16 17:03

shakedzy

1 Answers

collect_list will give you a list without removing duplicates. collect_set will automatically remove duplicates so just

select 
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name

this feature is implemented since spark 1.6.0 check it out:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

/**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * For now this is an alias for the collect_set Hive UDAF.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))

146

answered Oct 17 '22 03:10

Zahiro Mor

Related questions
                            
                                Using an aggregate function in where
                            
                                TSQL After Update Trigger check for update on multiple columns in one IF UPDATE
                            
                                How to implement FIFO in sql
                            
                                How to apply md5 function to field in django orm?
                            
                                How can I compare only hour and minute in sql
                            
                                How to calculate max columns in Postgresql
                            
                                bulkcopy with primary key not working
                            
                                Count consecutive duplicate values in SQL
                            
                                what does this symbol mean := in sql
                            
                                How do I select a given number of rows for one table for each parent primary key in another table in sql server 2012?
                            
                                Changing a column type from integer to string
                            
                                Update table using random of multiple values in other table
                            
                                Fetching complex data using FMDB
                            
                                T-SQL query - row iteration without cursor
                            
                                Converting Long to Varchar2
                            
                                Improve performance on MySQL fulltext search query
                            
                                Cumulative sum of values by month, filling in for missing months
                            
                                How to .update m2m field in django
                            
                                Oracle (+) outer join and constant values
                            
                                How to replace multiple values in 1 column in mysql SELECT query using REPLACE()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With