Spark INLINE Vs. LATERAL VIEW EXPLODE differences?

Tags:

In Spark, for the following use case, I'd like to understand what are the main differences between using the INLINE and EXPLODE ... I'm not sure if there are any performance implications or if one method is preferred over the other one or if there are any other uses cases where one is appropriate and the other is not...

The use case is to select 2 fields from a complex data type (array of structs), my instinct was to use INLINE since it explodes an array of structs

For example:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
),

inline_data AS (
SELECT id,
        INLINE(array_of_structs)
FROM sample
)

SELECT id,
        name AS person_name,
        age AS person_age
FROM inline_data

And using LATERAL VIEW EXPLODE:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        person.name,
        person.age
FROM sample
LATERAL VIEW EXPLODE(array_of_structs) exploded_people as person

The documentation clearly states what each one of these do but I'd like to better understand when to pick one over the other one.

894

asked May 27 '20 22:05

dim_user

1 Answers

EXPLODE UDTF will generate rows of struct (single column of type struct), and to get person name you need to use person.name:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        person.name,
        person.age
FROM sample
LATERAL VIEW explode(array_of_structs) exploded_people as person

Result:

id,name,age
1,frank,40
1,maria,51

And INLINE UDTF will generate a row-set with N columns (N = number of top level elements in the struct), so you do not need to use dot notation person.name because name and other struct elements are already extracted by INLINE:

WITH sample AS (
 SELECT 1 AS id,
        array(NAMED_STRUCT('name', 'frank',
                           'age', 40,
                           'state', 'Texas'
                           ),
              NAMED_STRUCT('name', 'maria',
                           'age', 51,
                           'state', 'Georgia'
                           )
              )            
            AS array_of_structs
)

SELECT  id,
        name,
        age
FROM sample
LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state

Result:

id,name,age
1,frank,40
1,maria,51

Both INLINE and EXPLODE are UDTFs and require LATERAL VIEW in Hive. In Spark it works fine without lateral view. The only difference is that EXPLODE returns dataset of array elements(struct in your case) and INLINE is used to get struct elements already extracted. You need to define all struct elements in case of INLINE like this: LATERAL VIEW inline(array_of_structs) exploded_people as name, age, state

From performance perspective both INLINE and EXPLODE work the same, you can use EXPLAIN command to check the plan. Extraction of struct elements in the UDTF or after UDTF does not affect performance.

INLINE requires to describe all struct elements (in Hive) and EXPLODE does not, so, explode may be more convenient if you do not need to extract all struct elements of if you do not need to extract elements at all. INLINE is convenient when you need to extract all or most of struct elements.

Your first code example works only in Spark. In Hive 2.1.1 it throws an exception because lateral view required.

In Spark this will work also:

inline_data AS (
SELECT id,
        EXPLODE(array_of_structs) as person
FROM sample
)

And to get age column you need to use person.age

116

answered Nov 14 '22 22:11

leftjoin

Related questions
                            
                                SQL Insert multiple record while using ON DUPLICATE KEY UPDATE
                            
                                SQL SUM on multiple INNER JOIN
                            
                                Newbie question: Problem with results, sql, join, where, "<" operator
                            
                                How to parse XML data in SQL server table
                            
                                How to have auto increment in ClickHouse?
                            
                                Comparing two columns in postgres database
                            
                                Update table using JSON in SQL
                            
                                SQL Server select variable where no results
                            
                                Get identity of row inserted in Snowflake Datawarehouse
                            
                                Remove duplicated subsets from very large table
                            
                                How display result count from query
                            
                                How to get everything before the last occurrence of a character in MySQL?
                            
                                STRING_SPLIT to Multiple Variables
                            
                                Oracle SQL "column ambiguously defined" with `FETCH FIRST n ROWS ONLY`
                            
                                Preventing insertion of duplicates without using indices
                            
                                SQL aggregation function to choose the only value
                            
                                Gap Filling OHLCV (Open High Low Close Volume) in TimescaleDB
                            
                                How do I SELECT WHERE IN VALUES with tuples in Python sqlite3?
                            
                                Add generated column to an existing table Postgres
                            
                                Explaining COUNT return value when used without group

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark INLINE Vs. LATERAL VIEW EXPLODE differences?

Tags:

arrays

sql

apache-spark

explode

hiveql

dim_user

People also ask

1 Answers

leftjoin

Recent Activity

Donate For Us