I have a dataset containing data like the following: <pre class="prettyprint"><code>|c1| c2| --------- | 1 | a | | 1 | b | | 1 | c | | 2 | a | | 2 | b | </code></pre> ... Now, I want to get the data grouped like the following (col1: String Key, col2: List): <pre class="prettyprint"><code>| c1| c2 | ----------- | 1 |a,b,c| | 2 | a, b| ... </code></pre> I thought that using goupByKey would be an sufficient solution, but I can't find any example, how to use it. Can anyone help me to find a solution using groupByKey or using any other combination of transformations and actions to get this output by using datasets, not RDD?

Here is Spark 2.0 and Java example with Dataset. <pre class="prettyprint"><code>public class SparkSample { public static void main(String[] args) { //SparkSession SparkSession spark = SparkSession .builder() .appName("SparkSample") .config("spark.sql.warehouse.dir", "/file:C:/temp") .master("local") .getOrCreate(); //input data List<Tuple2<Integer,String>> inputList = new ArrayList<Tuple2<Integer,String>>(); inputList.add(new Tuple2<Integer,String>(1, "a")); inputList.add(new Tuple2<Integer,String>(1, "b")); inputList.add(new Tuple2<Integer,String>(1, "c")); inputList.add(new Tuple2<Integer,String>(2, "a")); inputList.add(new Tuple2<Integer,String>(2, "b")); //dataset Dataset<Row> dataSet = spark.createDataset(inputList, Encoders.tuple(Encoders.INT(), Encoders.STRING())).toDF("c1","c2"); dataSet.show(); //groupBy and aggregate Dataset<Row> dataSet1 = dataSet.groupBy("c1").agg(org.apache.spark.sql.functions.collect_list("c2")).toDF("c1","c2"); dataSet1.show(); //stop spark.stop(); } } </code></pre>

GroupByKey with datasets in Spark 2.0 using Java

Tags:

java

group-by

dataset

apache-spark

apache-spark-2.0

I have a dataset containing data like the following:

|c1| c2|
---------
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | a |
| 2 | b |

...

Now, I want to get the data grouped like the following (col1: String Key, col2: List):

| c1| c2 |
-----------
| 1 |a,b,c|
| 2 | a, b|
...

I thought that using goupByKey would be an sufficient solution, but I can't find any example, how to use it.

Can anyone help me to find a solution using groupByKey or using any other combination of transformations and actions to get this output by using datasets, not RDD?

918

asked Sep 08 '16 12:09

Andreas

2 Answers

Here is Spark 2.0 and Java example with Dataset.

public class SparkSample {
    public static void main(String[] args) {
    //SparkSession
    SparkSession spark = SparkSession
            .builder()
            .appName("SparkSample")
            .config("spark.sql.warehouse.dir", "/file:C:/temp")
            .master("local")
            .getOrCreate();     
    //input data
    List<Tuple2<Integer,String>> inputList = new ArrayList<Tuple2<Integer,String>>();
    inputList.add(new Tuple2<Integer,String>(1, "a"));
    inputList.add(new Tuple2<Integer,String>(1, "b"));
    inputList.add(new Tuple2<Integer,String>(1, "c"));
    inputList.add(new Tuple2<Integer,String>(2, "a"));
    inputList.add(new Tuple2<Integer,String>(2, "b"));          
    //dataset
    Dataset<Row> dataSet = spark.createDataset(inputList, Encoders.tuple(Encoders.INT(), Encoders.STRING())).toDF("c1","c2");
    dataSet.show();     
    //groupBy and aggregate
    Dataset<Row> dataSet1 = dataSet.groupBy("c1").agg(org.apache.spark.sql.functions.collect_list("c2")).toDF("c1","c2");
    dataSet1.show();
    //stop
    spark.stop();
  }
}

answered Oct 04 '22 00:10

abaghel

With a DataFrame in Spark 2.0:

scala> val data = List((1, "a"), (1, "b"), (1, "c"), (2, "a"), (2, "b")).toDF("c1", "c2")
data: org.apache.spark.sql.DataFrame = [c1: int, c2: string]
scala> data.groupBy("c1").agg(collect_list("c2")).collect.foreach(println)
[1,WrappedArray(a, b, c)]
[2,WrappedArray(a, b)]

answered Oct 04 '22 01:10

J Bentz

Related questions
                            
                                how to use getChats in tdlib
                            
                                Why ConcurrentSkipListSet.contains requires comparator and not equals
                            
                                Return substring starting from the N occurrence
                            
                                Hibernate positional parameters zero based
                            
                                CORS headers on OPTIONS request with Jersey
                            
                                Antimalware Service Executable slow down IO operations [closed]
                            
                                Java8 Lambda Deserialization - ClassCastException
                            
                                Spring JPA Many to Many: remove entity, remove entry in join table, BUT NOT REMOVE the other side
                            
                                How to keep the map marker in the center of the screen at all times? Android
                            
                                JGit get pulled files
                            
                                How to call an embedded jre from command line in order to run java applications
                            
                                How to define program's requirements
                            
                                Manipulating configuration properties in spring
                            
                                How to get method body from ExecutableElement
                            
                                Exception while adding contact in google.Internal server Error
                            
                                Gradle: how does order of dependencies make a difference?
                            
                                Spring Boot static content with context path
                            
                                Is it possible to checkout only the directory structure in cvsclient in Java?
                            
                                Is it possible to retrieve cucumber tags?
                            
                                toUpperCase on Android is incorrect for two-argument and default Greek and Turkish Locales

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With