Big Query Deduplication query example explanation

Question

Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn't that the same as LIMIT 1? Why do we need to aggregation the entire table? Why can we aggregate an entire table in a single cell?

 # take the one name associated with a SKU
    WITH product_query AS (
      SELECT 
      DISTINCT 
      v2ProductName,
      productSKU
      FROM `data-to-insights.ecommerce.all_sessions_raw` 
      WHERE v2ProductName IS NOT NULL 
    )
    SELECT k.* FROM (
    # aggregate the products into an array and 
      # only take 1 result
      SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
      FROM product_query x 
      GROUP BY productSKU # this is the field we want deduplicated
    );

Felipe Hoffa · Accepted Answer

Let's start with some data we want to de-duplicate:

WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))

SELECT *
FROM table t

enter image description here

Now, instead of *, I'm going to use t to refer to the whole row:

SELECT t
FROM table t

enter image description here

What happens if I group each of these rows by their id:

SELECT t.id, ARRAY_AGG(t) tt
FROM table t
GROUP BY 1

enter image description here

Now I have all the rows with the same id grouped together. But let me choose only one:

SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
FROM table t
GROUP BY 1

enter image description here

That might look good, but that's still one row inside one array. How can I get only the row, and not an array:

SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1

enter image description here

And if I want to get back a row without the grouping id, nor the tt prefix:

SELECT tt.*
FROM (
  SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
  FROM table t
  GROUP BY 1
)

enter image description here

And that's how you de-duplicate rows based on the rows ids.

If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)

Big Query Deduplication query example explanation

Tags:

google-bigquery

NewPy

1 Answers

Felipe Hoffa

Recent Activity

Donate For Us

Big Query Deduplication query example explanation

Tags:

google-bigquery

NewPy

1 Answers

Felipe Hoffa

Related questions

Recent Activity

Donate For Us