How to use dataset to groupby

Tags:

I have a request to use rdd to do so：

val test = Seq(("New York", "Jack"),
    ("Los Angeles", "Tom"),
    ("Chicago", "David"),
    ("Houston", "John"),
    ("Detroit", "Michael"),
    ("Chicago", "Andrew"),
    ("Detroit", "Peter"),
    ("Detroit", "George")
  )
sc.parallelize(test).groupByKey().mapValues(_.toList).foreach(println)

The result is that：

(New York,List(Jack))

(Detroit,List(Michael, Peter, George))

(Los Angeles,List(Tom))

(Houston,List(John))

(Chicago,List(David, Andrew))

How to do it use dataset with spark2.0?

I have a way to use a custom function, but the feeling is so complicated, there is no simple point method？

842

asked Jun 07 '17 06:06

monkeysjourney

1 Answers

I would suggest you to start with creating a case class as

case class Monkey(city: String, firstName: String)

This case class should be defined outside the main class. Then you can just use toDS function and use groupBy and aggregation function called collect_list as below

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val test = Seq(("New York", "Jack"),
  ("Los Angeles", "Tom"),
  ("Chicago", "David"),
  ("Houston", "John"),
  ("Detroit", "Michael"),
  ("Chicago", "Andrew"),
  ("Detroit", "Peter"),
  ("Detroit", "George")
)
sc.parallelize(test)
  .map(row => Monkey(row._1, row._2))
  .toDS()
  .groupBy("city")
  .agg(collect_list("firstName") as "list")
  .show(false)

You will have output as

+-----------+------------------------+
|city       |list                    |
+-----------+------------------------+
|Los Angeles|[Tom]                   |
|Detroit    |[Michael, Peter, George]|
|Chicago    |[David, Andrew]         |
|Houston    |[John]                  |
|New York   |[Jack]                  |
+-----------+------------------------+

You can always convert back to RDD by just calling .rdd function

113

answered Nov 12 '22 18:11

Ramesh Maharjan

Related questions
                            
                                SPARK SQL Equivalent of Qualify + Row_number statements
                            
                                What does $( ) mean in Scala?
                            
                                Iterated take() or batch processing for Spark?
                            
                                Spark dataframes: Extract a column based on the value of another column
                            
                                Avro Schema to spark StructType
                            
                                How to load specific Hive partition in DataFrame Spark 1.6?
                            
                                How to write data in Elasticsearch from Pyspark?
                            
                                Spark-Hadoop-> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist
                            
                                How to use Scala DataFrameReader option method
                            
                                How to pass multiple statements into Spark SQL HiveContext
                            
                                PySpark -- Convert List of Rows to Data Frame
                            
                                How does Spark DataFrame distinguish between different VectorUDT objects?
                            
                                Spark - How many Executors and Cores are allocated to my spark job
                            
                                Accessing S3 from Spark 2.0
                            
                                perform join on multiple DataFrame in spark
                            
                                How to change Spark setting to allow spark.dynamicAllocation.enabled?
                            
                                Is it possible to execute a command on all workers within Apache Spark?
                            
                                Spark DataSet filter performance
                            
                                Dynamic Allocation for Spark Streaming
                            
                                Only one SparkContext may be running in this JVM - [SPARK]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use dataset to groupby

Tags:

dataset

apache-spark

apache-spark-2.0

monkeysjourney

People also ask

1 Answers

Ramesh Maharjan

Recent Activity

Donate For Us