According to the docs, the <code>collect_set</code> and <code>collect_list</code> functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image. I'm trying to do this in Scala: <pre class="prettyprint"><code>import org.apache.spark.sql.functions._ df.groupBy("column1") .agg(collect_set("column2")) .show() </code></pre> And receive the following error at runtime: <pre class="prettyprint"><code>Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set; </code></pre> Also tried it using <code>pyspark</code>, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions. How to fix this? Thanx!

Spark 2.0+: SPARK-10605 introduced native <code>collect_list</code> and <code>collect_set</code> implementation. <code>SparkSession</code> with Hive support or <code>HiveContext</code> are no longer required. Spark 2.0-SNAPSHOT (before 2016-05-03): You have to enable Hive support for a given <code>SparkSession</code>: In Scala: <pre class="prettyprint lang-scala prettyprint-override"><code>val spark = SparkSession.builder .master("local") .appName("testing") .enableHiveSupport() // <- enable Hive support. .getOrCreate() </code></pre> In Python: <pre class="prettyprint lang-py prettyprint-override"><code>spark = (SparkSession.builder .enableHiveSupport() .getOrCreate()) </code></pre> Spark < 2.0: To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize <code>SparkContext</code> using <code>HiveContext</code>. In Scala: <pre class="prettyprint lang-scala prettyprint-override"><code>import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SQLContext val sqlContext: SQLContext = new HiveContext(sc) </code></pre> In Python: <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql import HiveContext sqlContext = HiveContext(sc) </code></pre>

Use collect_list and collect_set in Spark SQL

Tags:

apache-spark

apache-spark-sql

hive

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image.

I'm trying to do this in Scala:

import org.apache.spark.sql.functions._ 

df.groupBy("column1") 
  .agg(collect_set("column2")) 
  .show()

And receive the following error at runtime:

Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set;

Also tried it using pyspark, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions.

How to fix this? Thanx!

756

asked Feb 20 '16 21:02

JFX

1 Answers

Spark 2.0+:

SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required.

Spark 2.0-SNAPSHOT (before 2016-05-03):

You have to enable Hive support for a given SparkSession:

In Scala:

val spark = SparkSession.builder
  .master("local")
  .appName("testing")
  .enableHiveSupport()  // <- enable Hive support.
  .getOrCreate()

In Python:

spark = (SparkSession.builder
    .enableHiveSupport()
    .getOrCreate())

Spark < 2.0:

To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext.

In Scala:

import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext

val sqlContext: SQLContext = new HiveContext(sc)

In Python:

from pyspark.sql import HiveContext

sqlContext = HiveContext(sc)

109

answered Sep 17 '22 17:09

zero323

Related questions
                            
                                what is difference between SparkSession and SparkContext? [duplicate]
                            
                                Usage of spark DataFrame "as" method
                            
                                Splitting a row in a PySpark Dataframe into multiple rows
                            
                                How can I calculate exact median with Apache Spark?
                            
                                What is an optimized way of joining large tables in Spark SQL
                            
                                Where is the reference for options for writing or reading per format?
                            
                                Spark SQL nested withColumn
                            
                                Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
                            
                                PySpark & MLLib: Random Forest Feature Importances
                            
                                Distributed Web crawling using Apache Spark - Is it Possible?
                            
                                What is rank in ALS machine Learning Algorithm in Apache Spark Mllib
                            
                                Spark - Creating Nested DataFrame
                            
                                spark sql current timestamp function
                            
                                Spark 2.0: Relative path in absolute URI (spark-warehouse)
                            
                                spark dataframe groupby multiple times
                            
                                How to execute spark submit on amazon EMR from Lambda function?
                            
                                How to import pyspark in anaconda
                            
                                Convert comma separated string to array in pyspark dataframe
                            
                                Spark on YARN resource manager: Relation between YARN Containers and Spark Executors
                            
                                How do I convert a WrappedArray column in spark dataframe to Strings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With