I have the following explode query, which works fine: <pre class="prettyprint"><code>data1 = sqlContext.sql("select explode(names) as name from data") </code></pre> I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did: <pre class="prettyprint"><code>data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data") </code></pre> But I got the errors: <pre class="prettyprint"><code> Only one generator allowed per select but Generate and and Explode found.; </code></pre> Does any one have any idea? <hr> I can actually make it work by doing two steps: <pre class="prettyprint"><code> data1 = sqlContext.sql("select explode(names) as name from data") data1.registerTempTable('data1') data1 = sqlContext.sql("select explode(colors) as color from data1") </code></pre> But I am wondering if it is possible to do it in one step? Thanks a lot!

Try lateral view explode instead. <pre class="prettyprint"><code>select name, color from data lateral view explode(names) as name lateral view explode(colors) as color; </code></pre>

SparkSQL : Can I explode two different variables in the same query?

Tags:

spark-dataframe

I have the following explode query, which works fine:

data1 = sqlContext.sql("select explode(names) as name from data")

I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did:

data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")

But I got the errors:

 Only one generator allowed per select but Generate and and Explode found.;

Does any one have any idea?

I can actually make it work by doing two steps:

   data1 = sqlContext.sql("select explode(names) as name from data")
   data1.registerTempTable('data1')
   data1 = sqlContext.sql("select explode(colors) as color from data1")

But I am wondering if it is possible to do it in one step? Thanks a lot!

836

asked Apr 26 '16 22:04

3 Answers

The correct syntax is

select name, color 
from data 
lateral view explode(names) exploded_names as name 
lateral view explode(colors) exploded_colors as color

The reason why Rashid's answer did not work is that it did not "name" the table generated by LATERAL VIEW.

Explanation

Think of it this way: LATERAL VIEW works like an implicit JOIN with with an ephemeral table created for every row from the structs in the collection being "viewed". So, the way to parse the syntax is:

LATERAL VIEW table_generation_function(collection_column) table_name AS col1, ...

Multiple output columns

If you use a table generating function such as posexplode() then you still have one output table but with multiple output columns:

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order

Nesting

You can also "nest" LATERAL VIEW by repeatedly exploding nested collections, e.g.,

LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
LATERAL VIEW posexplode(order.items) exploded_items AS item_number, item

Performance considerations

While we are on the topic of LATERAL VIEW it is important to note that using it via SparkSQL is more efficient than using it via the DataFrame DSL, e.g., myDF.explode(). The reason is that SQL can reason accurately about the schema while the DSL API has to perform type conversion between a language type and the dataframe row. What the DSL API loses in terms of performance, however, it gains in flexibility as you can return any supported type from explode, which means that you can perform a more complicated transformation in one step.

Update

In recent versions of Spark, row-level explode via df.explode() has been deprecated in favor of column-level explode via df.select(..., explode(...).as(...)). There is also an explode_outer(), which will produce output rows even if the input to be exploded is null. Column-level exploding does not suffer from the performance issues of row-level exploding mentioned above as Spark can perform the transformation entirely using internal row data representations.

answered Jan 11 '23 17:01

Sim

Try lateral view explode instead.

select name, color from data lateral view explode(names) as name lateral view explode(colors) as color;

answered Jan 11 '23 17:01

Rashid Ali

There's a simple way to do explode on multiple columns by df.withColumn.

scala> val data = spark.sparkContext.parallelize(Seq((Array("Alice", "Bob"), Array("Red", "Green", "Blue"))))
  .toDF("names", "colors")
data: org.apache.spark.sql.DataFrame = [names: array<string>, colors: array<string>]

scala> data.show
+------------+------------------+                                               
|       names|            colors|
+------------+------------------+
|[Alice, Bob]|[Red, Green, Blue]|
+------------+------------------+

scala> data.withColumn("name", explode('names))
  .withColumn("color", explode('colors))
  .show

+------------+------------------+-----+-----+
|       names|            colors| name|color|
+------------+------------------+-----+-----+
|[Alice, Bob]|[Red, Green, Blue]|Alice|  Red|
|[Alice, Bob]|[Red, Green, Blue]|Alice|Green|
|[Alice, Bob]|[Red, Green, Blue]|Alice| Blue|
|[Alice, Bob]|[Red, Green, Blue]|  Bob|  Red|
|[Alice, Bob]|[Red, Green, Blue]|  Bob|Green|
|[Alice, Bob]|[Red, Green, Blue]|  Bob| Blue|
+------------+------------------+-----+-----+

answered Jan 11 '23 18:01

Todd Leo

Related questions
                            
                                Casting a new derived column in a DataFrame from boolean to integer
                            
                                Spark SQL converting string to timestamp
                            
                                How to get keys and values from MapType column in SparkSQL DataFrame
                            
                                Is there a way to add extra metadata for Spark dataframes?
                            
                                Applying Mapping Function on DataFrame
                            
                                PySpark add a column to a DataFrame from a TimeStampType column
                            
                                RDD Aggregate in spark
                            
                                Spark RDD - is partition(s) always in RAM?
                            
                                How can I get from 'pyspark.sql.types.Row' all the columns/attributes name?
                            
                                how to select all columns that starts with a common label
                            
                                Standalone Manager Vs. Yarn Vs. Mesos
                            
                                The system cannot find the path specified error while running pyspark
                            
                                Spark UDF with varargs
                            
                                Trouble building a simple SparkSQL application
                            
                                Limit Kafka batches size when using Spark Streaming
                            
                                PySpark: TypeError: condition should be string or Column
                            
                                Spark Dataframes UPSERT to Postgres Table
                            
                                spark sql window function lag
                            
                                Apache Spark java.lang.ClassNotFoundException
                            
                                Spark can access Hive table from pyspark but not from spark-submit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With