Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL on Spark: How do I get all values of DISTINCT?

So, assume I have the following table:

Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue

I would like to get a table of the distinct colors for each name - how many and their values. Meaning, something like this:

Name | Distinct | Values
--------------------------------------
John |   2      | Blue, Yellow
Greg |   2      | Red, Blue

Any ideas how to do so?

like image 632
shakedzy Avatar asked Mar 20 '16 17:03

shakedzy


People also ask

How do I select distinct values in spark?

Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.

How can I get distinct values of all columns in SQL?

MySQL – Distinct Values To get unique or distinct values of a column in MySQL Table, use the following SQL Query. SELECT DISTINCT(column_name) FROM your_table_name; You can select distinct values for one or more columns.

How do you find unique values in PySpark?

In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.

How do you select distinct records in PySpark DataFrame?

To select distinct on multiple columns using the dropDuplicates(). This function takes columns where you wanted to select distinct values and returns a new DataFrame with unique values on selected columns. When no argument is used it behaves exactly the same as a distinct() function.


1 Answers

collect_list will give you a list without removing duplicates. collect_set will automatically remove duplicates so just

select 
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name

this feature is implemented since spark 1.6.0 check it out:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

/**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * For now this is an alias for the collect_set Hive UDAF.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))
like image 146
Zahiro Mor Avatar answered Oct 17 '22 03:10

Zahiro Mor