I have the following explode query, which works fine:
data1 = sqlContext.sql("select explode(names) as name from data")
I want to explode another field "colors", so the final output could be the cartesian product of names and colors. So I did:
data1 = sqlContext.sql("select explode(names) as name, explode(colors) as color from data")
But I got the errors:
Only one generator allowed per select but Generate and and Explode found.;
Does any one have any idea?
I can actually make it work by doing two steps:
data1 = sqlContext.sql("select explode(names) as name from data")
data1.registerTempTable('data1')
data1 = sqlContext.sql("select explode(colors) as color from data1")
But I am wondering if it is possible to do it in one step? Thanks a lot!
Spark SQL explode function is used to create or split an array or map DataFrame columns to rows. Spark defines several flavors of this function; explode_outer – to handle nulls and empty, posexplode – which explodes with a position of element and posexplode_outer – to handle nulls.
If EXPLODE is applied on an instance of SQL. ARRAY <T>, the resulting rowset contains a single column of type T where each item in the array is placed into its own row. If the array value was empty or null, then the resulting rowset is empty. If EXPLODE is applied on an instance of SQL.
Lateral view explodes the array data into multiple rows. In other words, lateral view expands the array into rows. When you use a lateral view along with the explode function, you will get the result something like below.
The correct syntax is
select name, color
from data
lateral view explode(names) exploded_names as name
lateral view explode(colors) exploded_colors as color
The reason why Rashid's answer did not work is that it did not "name" the table generated by LATERAL VIEW
.
Think of it this way: LATERAL VIEW
works like an implicit JOIN
with with an ephemeral table created for every row from the structs
in the collection being "viewed". So, the way to parse the syntax is:
LATERAL VIEW table_generation_function(collection_column) table_name AS col1, ...
If you use a table generating function such as posexplode()
then you still have one output table but with multiple output columns:
LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
You can also "nest" LATERAL VIEW
by repeatedly exploding nested collections, e.g.,
LATERAL VIEW posexplode(orders) exploded_orders AS order_number, order
LATERAL VIEW posexplode(order.items) exploded_items AS item_number, item
While we are on the topic of LATERAL VIEW
it is important to note that using it via SparkSQL is more efficient than using it via the DataFrame
DSL, e.g., myDF.explode()
. The reason is that SQL can reason accurately about the schema while the DSL API has to perform type conversion between a language type and the dataframe row. What the DSL API loses in terms of performance, however, it gains in flexibility as you can return any supported type from explode
, which means that you can perform a more complicated transformation in one step.
In recent versions of Spark, row-level explode via df.explode()
has been deprecated in favor of column-level explode via df.select(..., explode(...).as(...))
. There is also an explode_outer()
, which will produce output rows even if the input to be exploded is null
. Column-level exploding does not suffer from the performance issues of row-level exploding mentioned above as Spark can perform the transformation entirely using internal row data representations.
Try lateral view explode instead.
select name, color from data lateral view explode(names) as name lateral view explode(colors) as color;
There's a simple way to do explode on multiple columns by df.withColumn
.
scala> val data = spark.sparkContext.parallelize(Seq((Array("Alice", "Bob"), Array("Red", "Green", "Blue"))))
.toDF("names", "colors")
data: org.apache.spark.sql.DataFrame = [names: array<string>, colors: array<string>]
scala> data.show
+------------+------------------+
| names| colors|
+------------+------------------+
|[Alice, Bob]|[Red, Green, Blue]|
+------------+------------------+
scala> data.withColumn("name", explode('names))
.withColumn("color", explode('colors))
.show
+------------+------------------+-----+-----+
| names| colors| name|color|
+------------+------------------+-----+-----+
|[Alice, Bob]|[Red, Green, Blue]|Alice| Red|
|[Alice, Bob]|[Red, Green, Blue]|Alice|Green|
|[Alice, Bob]|[Red, Green, Blue]|Alice| Blue|
|[Alice, Bob]|[Red, Green, Blue]| Bob| Red|
|[Alice, Bob]|[Red, Green, Blue]| Bob|Green|
|[Alice, Bob]|[Red, Green, Blue]| Bob| Blue|
+------------+------------------+-----+-----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With