I have a table of two string type columns <code>(username, friend)</code> and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: <code>('username1', 'friends1, friends2, friends3')</code> I know MySQL does this with <code>GROUP_CONCAT</code>. Is there any way to do this with Spark SQL?

You can try the collect_list function <pre class="prettyprint"><code>sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A </code></pre> Or you can regieter a UDF something like <pre class="prettyprint"><code>sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b)) </code></pre> and you can use this function in the query <pre class="prettyprint"><code>sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A") </code></pre>

Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function

Tags:

aggregate-functions

apache-spark

apache-spark-sql

I have a table of two string type columns (username, friend) and for each username, I want to collect all of its friends on one row, concatenated as strings. For example: ('username1', 'friends1, friends2, friends3')

I know MySQL does this with GROUP_CONCAT. Is there any way to do this with Spark SQL?

560

asked Jul 26 '15 18:07

Zahra I.S

2 Answers

Before you proceed: This operations is yet another another groupByKey. While it has multiple legitimate applications it is relatively expensive so be sure to use it only when required.

Not exactly concise or efficient solution but you can use UserDefinedAggregateFunction introduced in Spark 1.5.0:

object GroupConcat extends UserDefinedAggregateFunction {     def inputSchema = new StructType().add("x", StringType)     def bufferSchema = new StructType().add("buff", ArrayType(StringType))     def dataType = StringType     def deterministic = true       def initialize(buffer: MutableAggregationBuffer) = {       buffer.update(0, ArrayBuffer.empty[String])     }      def update(buffer: MutableAggregationBuffer, input: Row) = {       if (!input.isNullAt(0))          buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0))     }      def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {       buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0))     }      def evaluate(buffer: Row) = UTF8String.fromString(       buffer.getSeq[String](0).mkString(",")) }

Example usage:

val df = sc.parallelize(Seq(   ("username1", "friend1"),   ("username1", "friend2"),   ("username2", "friend1"),   ("username2", "friend3") )).toDF("username", "friend")  df.groupBy($"username").agg(GroupConcat($"friend")).show  ## +---------+---------------+ ## | username|        friends| ## +---------+---------------+ ## |username1|friend1,friend2| ## |username2|friend1,friend3| ## +---------+---------------+

You can also create a Python wrapper as shown in Spark: How to map Python with Scala or Java User Defined Functions?

In practice it can be faster to extract RDD, groupByKey, mkString and rebuild DataFrame.

You can get a similar effect by combining collect_list function (Spark >= 1.6.0) with concat_ws:

import org.apache.spark.sql.functions.{collect_list, udf, lit}  df.groupBy($"username")   .agg(concat_ws(",", collect_list($"friend")).alias("friends"))

171

answered Sep 18 '22 19:09

zero323

You can try the collect_list function

sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A

Or you can regieter a UDF something like

sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))

and you can use this function in the query

sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")

answered Sep 18 '22 19:09

iec2011007

Related questions
                            
                                Drop spark dataframe from cache
                            
                                Why does spark-submit and spark-shell fail with "Failed to find Spark assembly JAR. You need to build Spark before running this program."?
                            
                                Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB
                            
                                How can I connect to a postgreSQL database into Apache Spark using scala?
                            
                                Cleanest, most efficient syntax to perform DataFrame self-join in Spark
                            
                                SparkSQL vs Hive on Spark - Difference and pros and cons?
                            
                                Compute size of Spark dataframe - SizeEstimator gives unexpected results
                            
                                build.sbt: how to add spark dependencies
                            
                                Why spark-shell fails with NullPointerException?
                            
                                Pyspark convert a standard list to data frame [duplicate]
                            
                                What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
                            
                                Adding a new column in Data Frame derived from other columns (Spark)
                            
                                Spark: Best practice for retrieving big data from RDD to local machine
                            
                                Apache Spark: Differences between client and cluster deploy modes
                            
                                Custom delimiter csv reader spark
                            
                                Create new column with function in Spark Dataframe
                            
                                How to define and use a User-Defined Aggregate Function in Spark SQL?
                            
                                How take a random row from a PySpark DataFrame?
                            
                                Spark 2.0.x dump a csv file from a dataframe containing one array of type string
                            
                                Un-persisting all dataframes in (py)spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With