Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group by common element in array?

I am trying to find the solution in spark to group data with a common element in an array.

 key                            value
[k1,k2]                         v1
[k2]                            v2
[k3,k2]                         v3
[k4]                            v4

If any element matches in key, we have to assign same groupid to that.(Groupby common element)

Result:

key                             value  GroupID
[k1,k2]                           v1    G1
[k2]                              v2    G1
[k3,k2]                           v3    G1 
[k4]                              v4    G2

Some suggestions are already given with Spark Graphx, but at this moment learning curve will be more to implement this for a single feature.

like image 354
Aravind Kumar Anugula Avatar asked May 11 '17 12:05

Aravind Kumar Anugula


People also ask

Can arrays be used to group similar data together?

One of the most common is to create an array. An array is a collection of individual values. For instance, I might have a collection of sales numbers or ID numbers for a group of people. In an array these items are grouped together and are able to be referred to using a single name.

How do you group an array in Javascript?

The groupToMap() method groups the elements in an array using the values returned by its callback function. It returns a Map with the unique values from the callback function as keys, which can be used to access the array of elements in each group.

How do you find the first common element of two arrays?

First Method (Naive Approach) – Find Common Elements in Two Arrays using Two For Loops. In this approach, we take each element of a first array and compare with every element of a second array. If it is found in second array then it's a common element else we move to next element.


1 Answers

Include graphframes (the latest supported Spark version is 2.1, but it should support 2.2 as well, if you use newer you'll have to build your own with 2.3 patch) replacing XXX with Spark version and YYY with Scala version:

spark.jars.packages  graphframes:graphframes:0.5.0-sparkXXX-s_YYY

Add explode keys:

import org.apache.spark.sql.functions._

val df = Seq(
   (Seq("k1", "k2"), "v1"), (Seq("k2"), "v2"),
   (Seq("k3", "k2"), "v3"), (Seq("k4"), "v4")
).toDF("key", "value")

val edges = df.select(
  explode($"key") as "src", $"value" as "dst")

Convert to graphframe:

import org.graphframes._

val gf = GraphFrame.fromEdges(edges)

Set checkpoint directory (if not set):

import org.apache.spark.sql.SparkSession

val path: String = ???
val spark: SparkSession = ???
spark.sparkContext.setCheckpointDir(path)

Find connected components:

val components = GraphFrame.fromEdges(edges).connectedComponents.setAlgorithm("graphx").run

Join result with input data:

 val result = components.where($"id".startsWith("v")).toDF("value", "group").join(df, Seq("value"))

Check result:

result.show

// +-----+------------+--------+
// |value|       group|     key|
// +-----+------------+--------+
// |   v3|489626271744|[k3, k2]|
// |   v2|489626271744|    [k2]|
// |   v4|532575944704|    [k4]|
// |   v1|489626271744|[k1, k2]|
// +-----+------------+--------+
like image 156
Alper t. Turker Avatar answered Oct 25 '22 02:10

Alper t. Turker