Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

I want to get the list of distinct elements inside each feature column, so the output will be:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

what is the best way to do this in Scala?

like image 757
Masoud Tavazoei Avatar asked Jun 14 '16 02:06

Masoud Tavazoei


People also ask

How do I get unique values of a column in spark DataFrame?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.

How do you find the unique elements in a column PySpark?

Use pyspark distinct() to select unique rows from all columns. It returns a new DataFrame after selecting only distinct column values, when it finds any rows having unique values on all columns it will be eliminated from the results.

How does PySpark define ArrayType?

Create PySpark ArrayType You can create an instance of an ArrayType using ArraType() class, This takes arguments valueType and one optional argument valueContainsNull to specify if a value can accept null, by default it takes True. valueType should be a PySpark type that extends DataType class.

How do I select specific columns in spark DataFrame?

You can select the single or multiple columns of the Spark DataFrame by passing the column names you wanted to select to the select() function. Since DataFrame is immutable, this creates a new DataFrame with a selected columns. show() function is used to show the DataFrame contents.


2 Answers

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])
like image 159
Psidom Avatar answered Sep 22 '22 11:09

Psidom


one more solution for spark 2.4+

.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))

beware, if one of columns is null, result will be null

like image 34
Avils Avatar answered Sep 20 '22 11:09

Avils