Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark INLINE Vs. LATERAL VIEW EXPLODE differences?

In Spark, for the following use case, I'd like to understand what are the main differences between using the INLINE and EXPLODE ... I'm not sure if there are any performance implications or if one method is preferred over the other one or if there are any other uses cases where one is appropriate and the other is not...

The use case is to select 2 fields from a complex data type (array of structs), my instinct was to use INLINE since it explodes an array of structs

For example:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
),

inline_data AS (
SELECT id,
        INLINE(array_of_structs)
FROM sample
)

SELECT id,
        name AS person_name,
        age AS person_age
FROM inline_data

And using LATERAL VIEW EXPLODE:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        person.name,
        person.age
FROM sample
LATERAL VIEW EXPLODE(array_of_structs) exploded_people as person 

The documentation clearly states what each one of these do but I'd like to better understand when to pick one over the other one.

like image 894
dim_user Avatar asked May 27 '20 22:05

dim_user


People also ask

What does lateral view explode do?

Lateral view explodes the array data into multiple rows. In other words, lateral view expands the array into rows. When you use a lateral view along with the explode function, you will get the result something like below.

What is lateral view in Spark?

The LATERAL VIEW clause is used in conjunction with generator functions such as EXPLODE , which will generate a virtual table containing one or more rows. LATERAL VIEW will apply the rows to each original output row.

What does explode function do in Spark?

explode (col)[source] Returns a new row for each element in the given array or map. Uses the default column name col for elements in the array and key and value for elements in the map unless specified otherwise.


1 Answers

EXPLODE UDTF will generate rows of struct (single column of type struct), and to get person name you need to use person.name:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        person.name,
        person.age
FROM sample
LATERAL VIEW explode(array_of_structs) exploded_people as person

Result:

id,name,age
1,frank,40
1,maria,51

And INLINE UDTF will generate a row-set with N columns (N = number of top level elements in the struct), so you do not need to use dot notation person.name because name and other struct elements are already extracted by INLINE:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        name,
        age
FROM sample
LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state

Result:

id,name,age
1,frank,40
1,maria,51 

Both INLINE and EXPLODE are UDTFs and require LATERAL VIEW in Hive. In Spark it works fine without lateral view. The only difference is that EXPLODE returns dataset of array elements(struct in your case) and INLINE is used to get struct elements already extracted. You need to define all struct elements in case of INLINE like this: LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state

From performance perspective both INLINE and EXPLODE work the same, you can use EXPLAIN command to check the plan. Extraction of struct elements in the UDTF or after UDTF does not affect performance.

INLINE requires to describe all struct elements (in Hive) and EXPLODE does not, so, explode may be more convenient if you do not need to extract all struct elements of if you do not need to extract elements at all. INLINE is convenient when you need to extract all or most of struct elements.

Your first code example works only in Spark. In Hive 2.1.1 it throws an exception because lateral view required.

In Spark this will work also:

inline_data AS (
SELECT id,
        EXPLODE(array_of_structs) as person
FROM sample
)

And to get age column you need to use person.age

like image 116
leftjoin Avatar answered Nov 14 '22 22:11

leftjoin