SQL array flattening: Why doesn't CROSS JOIN UNNEST join every nested value with every row?

Tags:

This question isn't about solving a particular problem, it's about understanding what's actually happening behind the scenes in a common SQL idiom used to flatten arrays. There's some magic behind the scenes and I want to peek behind the curtain of syntactic sugar and see what's going on.

Let's consider the following table t1 :

Now let's assume we have a function called FLATTEN that takes a column of type array and unpacks each of the arrays in that column so that we're left with one row for each value in each array -- if we run SELECT FLATTEN(numbers_array) AS flattened_numbers FROM t1, we'd expect the following, which we'll call t2

In SQL, the CROSS JOIN combines rows from two tables by combining each row from the first table with each row from the second table. So if we run SELECT id, flattened.flattened_numbers from t1 CROSS JOIN flattened, we get

enter image description here

Now flatten is just an imaginary function, and as you can see it's not very useful to combine it with a CROSS JOIN, because each of the original values of the id column gets mixed with flattened_numbers from each of the original rows. Everything gets mixed up because we don't have a WHERE clause that selects only the rows of the CROSS JOIN that we want.

The pattern people actually use to flatten arrays looks like this: SELECT id, flattened_numbers FROM t1 CROSS JOIN UNNEST(sequences.some_numbers) AS flattened_numbers, which produces

enter image description here

But I don't understand why the CROSS JOIN UNNEST pattern actually works. Because the CROSS JOIN doesn't include a WHERE clause, I would expect it behave just like the FLATTEN function I outlined above, where every unnested value gets combined with every row from t1.

Can someone 'unpack' what's actually going on in the CROSS JOIN UNNEST pattern that ensures that each row is only joined with it's own nested values (and not with the nested values from other rows)?

357

asked Dec 06 '17 09:12

conradlee

1 Answers

The best way to think about this is by looking at what happens on a row-by-row basis. Setting up some input data, we have:

Click to copy

WITH t1 AS (   SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL   SELECT 2, [2, 4, 5] ) ...

(I'm using a third element for the second row to make things more interesting). If we just select from it, we get output that looks like this:

Click to copy

WITH t1 AS (   SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL   SELECT 2, [2, 4, 5] ) SELECT * FROM t1; +----+---------------+ | id | numbers_array | +----+---------------+ | 1  | [0, 1]        | | 2  | [2, 4, 5]     | +----+---------------+

Now let's talk about unnesting. The UNNEST function takes an array and returns a value table of the array's element type. Whereas most BigQuery tables are SQL tables defined as a collection of columns, a value table has rows of some value type. For numbers_array, UNNEST(numbers_array) returns a value table whose value type is INT64, since numbers_array is an array with an element type of INT64. This value table contains all of the elements in numbers_array for the current row from t1.

For the row with an id of 1, the contents of the value table returned by UNNEST(numbers_array) are:

Click to copy

+-----+ | f0_ | +-----+ | 0   | | 1   | +-----+

This is the same as what we get with the following query:

Click to copy

SELECT * FROM UNNEST([0, 1]);

UNNEST([0, 1]) in this case means "create a value table from the INT64 values 0 and 1".

Similarly, for the row with an id of 2, the contents of the value table returned by UNNEST(numbers_array) are:

Click to copy

+-----+ | f0_ | +-----+ | 2   | | 4   | | 5   | +-----+

Now let's talk about how CROSS JOIN fits into the picture. In most cases, you use CROSS JOIN between two uncorrelated tables. In other words, the contents of the table on the right of the CROSS JOIN are not defined by the current contents of the table on the left.

In the case of arrays and UNNEST, however, the contents of the value table produced by UNNEST(numbers_array) change depending on the current row of t1. When we join the two tables, we get the cross product of the current row from t1 with all of the rows from UNNEST(numbers_array). For example:

Click to copy

WITH t1 AS (   SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL   SELECT 2, [2, 4, 5] ) SELECT id, number FROM t1 CROSS JOIN UNNEST(numbers_array) AS number; +----+--------+ | id | number | +----+--------+ | 1  | 0      | | 1  | 1      | | 2  | 2      | | 2  | 4      | | 2  | 5      | +----+--------+

numbers_array has two elements in the first row and three elements in the second, so we get 2 + 3 = 5 rows in the result of the query.

To answer the question about how this differs from flattening the numbers_array and then performing a CROSS JOIN, let's look at the results of this query:

Click to copy

WITH t1 AS (   SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL   SELECT 2, [2, 4, 5] ), t2 AS (   SELECT number   FROM t1   CROSS JOIN UNNEST(numbers_array) AS number ) SELECT number FROM t2; +--------+ | number | +--------+ | 0      | | 1      | | 2      | | 4      | | 5      | +--------+

In this case, t2 is is a SQL table with a column named number with those values. If we perform a CROSS JOIN between t1 and t2, we get a true cross product of all rows:

Click to copy

WITH t1 AS (   SELECT 1 AS id, [0, 1] AS numbers_array UNION ALL   SELECT 2, [2, 4, 5] ), t2 AS (   SELECT number   FROM t1   CROSS JOIN UNNEST(numbers_array) AS number ) SELECT id, numbers_array, number FROM t1 CROSS JOIN t2; +----+---------------+--------+ | id | numbers_array | number | +----+---------------+--------+ | 1  | [0, 1]        | 0      | | 1  | [0, 1]        | 1      | | 1  | [0, 1]        | 2      | | 1  | [0, 1]        | 4      | | 1  | [0, 1]        | 5      | | 2  | [2, 4, 5]     | 0      | | 2  | [2, 4, 5]     | 1      | | 2  | [2, 4, 5]     | 2      | | 2  | [2, 4, 5]     | 4      | | 2  | [2, 4, 5]     | 5      | +----+---------------+--------+

So what's the difference between this and the previous query with CROSS JOIN UNNEST(numbers_array)? In this case, the contents of t2 don't change for each row from t1. For the first row in t1, there are five rows in t2. For the second row in t1, there are five rows in t2. As a result, the CROSS JOIN between the two of them returns 5 + 5 = 10 rows in total.

answered Oct 09 '22 14:10

Elliott Brossard

Related questions
                            
                                Add business days to date in SQL without loops
                            
                                MYSQL - count number of rows in each table
                            
                                SQL Server row date last modified
                            
                                How to Replace Multiple Characters in SQL?
                            
                                Copy rows from the same table and update the ID column
                            
                                Oracle : select maximum value from different columns of the same row
                            
                                Need SQL Query to find Parent records without child records
                            
                                SQL conditional SELECT
                            
                                SQL Server 2008 IIF statement does not seem enabled
                            
                                How can I get the raw query string from Laravel's query builder BEFORE executing the query?
                            
                                Check if a variable is null in plsql
                            
                                CASE vs. DECODE
                            
                                SQL Server : check if variable is Empty or NULL for WHERE clause
                            
                                SQL ROWNUM how to return rows between a specific range
                            
                                How to store historical records in a history table in SQL Server
                            
                                What are the major differences between the mysql and oracle sql dialects?
                            
                                How to clear all cached items in Oracle
                            
                                Performance difference: condition placed at INNER JOIN vs WHERE clause
                            
                                How to use Elastic Search on top of a pre-existing SQL Database?
                            
                                Using DISTINCT inner join in SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL array flattening: Why doesn't CROSS JOIN UNNEST join every nested value with every row?

Tags:

sql

google-bigquery

conradlee

People also ask

1 Answers

Elliott Brossard

Recent Activity

Donate For Us