get the distinct elements of an ArrayType column in a spark dataframe

Tags:

spark-dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:

Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[] 

2, ["feat1_2"],["feat2_1","feat2_2"]

3,["feat1_4"],["feat2_3"]

I want to get the list of distinct elements inside each feature column, so the output will be:

distinct_feat1,distinct_feat2
-----------------------------  
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]

what is the best way to do this in Scala?

757

asked Jun 14 '16 02:06

2 Answers

You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:

import org.apache.spark.sql.functions._

val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
                     withColumn("feat2", explode(col("feat2"))).
                     agg(collect_set("feat1").alias("distinct_feat1"), 
                         collect_set("feat2").alias("distinct_feat2"))

distinct_df.show
+--------------------+--------------------+
|      distinct_feat1|      distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+


distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
                                                WrappedArray(, feat2_1, feat2_2, feat2_3)])

159

answered Sep 22 '22 11:09

Psidom

one more solution for spark 2.4+

.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))

beware, if one of columns is null, result will be null

answered Sep 20 '22 11:09

Avils

Related questions
                            
                                Does the scala compiler do anything to optimize implicit classes?
                            
                                Is Scala strongly typed ? [closed]
                            
                                How to provide a default typeclass for generic types in Scala?
                            
                                Scala syntax strangeness with :: and requiring lower case
                            
                                How to create a graph from Array[(Any, Any)] using Graph.fromEdgeTuples
                            
                                Performance of splitAt function on a vector
                            
                                How can you can write generic Scala enhancement methods that bind collection type as well as element type?
                            
                                Scala macro - Infer implicit value using `c.prefix`
                            
                                get size of parquet file in HDFS for repartition with Spark in Scala
                            
                                Scalatest custom matchers for 'should contain'
                            
                                DataFrame explode list of JSON objects
                            
                                Scala Slick filter and join
                            
                                Memory issue when importing parquet files in Spark
                            
                                How to transfer a float array (without serializing/deserializing) from Scala (JeroMQ) to C (ZMQ)?
                            
                                ScalaFX Button => How to define the action?
                            
                                Function literals vs function values
                            
                                Verify X-Hub-Signature from Facebook
                            
                                OneHotEncoder in Spark Dataframe in Pipeline
                            
                                Who can explain the meaning of this scala code
                            
                                Import different db drivers in Slick

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

get the distinct elements of an ArrayType column in a spark dataframe

Tags:

scala

spark-dataframe

Masoud Tavazoei

People also ask

2 Answers

Psidom

Avils

Recent Activity

Donate For Us