In Spark, for the following use case, I'd like to understand what are the main differences between using the INLINE and EXPLODE ... I'm not sure if there are any performance implications or if one method is preferred over the other one or if there are any other uses cases where one is appropriate and the other is not...
The use case is to select 2 fields from a complex data type (array of structs), my instinct was to use INLINE since it explodes an array of structs
For example:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
),
inline_data AS (
SELECT id,
INLINE(array_of_structs)
FROM sample
)
SELECT id,
name AS person_name,
age AS person_age
FROM inline_data
And using LATERAL VIEW EXPLODE:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
person.name,
person.age
FROM sample
LATERAL VIEW EXPLODE(array_of_structs) exploded_people as person
The documentation clearly states what each one of these do but I'd like to better understand when to pick one over the other one.
Lateral view explodes the array data into multiple rows. In other words, lateral view expands the array into rows. When you use a lateral view along with the explode function, you will get the result something like below.
The LATERAL VIEW clause is used in conjunction with generator functions such as EXPLODE , which will generate a virtual table containing one or more rows. LATERAL VIEW will apply the rows to each original output row.
explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.
EXPLODE
UDTF will generate rows of struct (single column of type struct), and to get person name you need to use person.name
:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
person.name,
person.age
FROM sample
LATERAL VIEW explode(array_of_structs) exploded_people as person
Result:
id,name,age
1,frank,40
1,maria,51
And INLINE
UDTF will generate a row-set with N columns (N = number of top level elements in the struct), so you do not need to use dot notation person.name
because name
and other struct elements are already extracted by INLINE
:
WITH sample AS (
SELECT 1 AS id,
array(NAMED_STRUCT('name', 'frank',
'age', 40,
'state', 'Texas'
),
NAMED_STRUCT('name', 'maria',
'age', 51,
'state', 'Georgia'
)
)
AS array_of_structs
)
SELECT id,
name,
age
FROM sample
LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state
Result:
id,name,age
1,frank,40
1,maria,51
Both INLINE
and EXPLODE
are UDTFs and require LATERAL VIEW
in Hive. In Spark it works fine without lateral view. The only difference is that EXPLODE
returns dataset of array elements(struct in your case) and INLINE
is used to get struct elements already extracted. You need to define all struct elements in case of INLINE like this: LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state
From performance perspective both INLINE and EXPLODE work the same, you can use EXPLAIN command to check the plan. Extraction of struct elements in the UDTF or after UDTF does not affect performance.
INLINE requires to describe all struct elements (in Hive) and EXPLODE does not, so, explode may be more convenient if you do not need to extract all struct elements of if you do not need to extract elements at all. INLINE is convenient when you need to extract all or most of struct elements.
Your first code example works only in Spark. In Hive 2.1.1 it throws an exception because lateral view
required.
In Spark this will work also:
inline_data AS (
SELECT id,
EXPLODE(array_of_structs) as person
FROM sample
)
And to get age column you need to use person.age
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With