So, assume I have the following table:
Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue
I would like to get a table of the distinct colors for each name - how many and their values. Meaning, something like this:
Name | Distinct | Values
--------------------------------------
John | 2 | Blue, Yellow
Greg | 2 | Red, Blue
Any ideas how to do so?
Distinct value of the column in pyspark is obtained by using select() function along with distinct() function. select() function takes up mutiple column names as argument, Followed by distinct() function will give distinct value of those columns combined.
MySQL – Distinct Values To get unique or distinct values of a column in MySQL Table, use the following SQL Query. SELECT DISTINCT(column_name) FROM your_table_name; You can select distinct values for one or more columns.
In PySpark, you can use distinct(). count() of DataFrame or countDistinct() SQL function to get the count distinct. distinct() eliminates duplicate records(matching all columns of a Row) from DataFrame, count() returns the count of records on DataFrame.
To select distinct on multiple columns using the dropDuplicates(). This function takes columns where you wanted to select distinct values and returns a new DataFrame with unique values on selected columns. When no argument is used it behaves exactly the same as a distinct() function.
collect_list will give you a list without removing duplicates. collect_set will automatically remove duplicates so just
select
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name
this feature is implemented since spark 1.6.0 check it out:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
/**
* Aggregate function: returns a set of objects with duplicate elements eliminated.
*
* For now this is an alias for the collect_set Hive UDAF.
*
* @group agg_funcs
* @since 1.6.0
*/
def collect_set(columnName: String): Column = collect_set(Column(columnName))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With