Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big Query Deduplication query example explanation

Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn't that the same as LIMIT 1? Why do we need to aggregation the entire table? Why can we aggregate an entire table in a single cell?

 # take the one name associated with a SKU
    WITH product_query AS (
      SELECT 
      DISTINCT 
      v2ProductName,
      productSKU
      FROM `data-to-insights.ecommerce.all_sessions_raw` 
      WHERE v2ProductName IS NOT NULL 
    )
    SELECT k.* FROM (
    # aggregate the products into an array and 
      # only take 1 result
      SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
      FROM product_query x 
      GROUP BY productSKU # this is the field we want deduplicated
    );
like image 972
NewPy Avatar asked Jun 12 '26 07:06

NewPy


1 Answers

Let's start with some data we want to de-duplicate:

WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))

SELECT *
FROM table t

enter image description here

Now, instead of *, I'm going to use t to refer to the whole row:

SELECT t
FROM table t

enter image description here

What happens if I group each of these rows by their id:

SELECT t.id, ARRAY_AGG(t) tt
FROM table t
GROUP BY 1

enter image description here

Now I have all the rows with the same id grouped together. But let me choose only one:

SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
FROM table t
GROUP BY 1

enter image description here

That might look good, but that's still one row inside one array. How can I get only the row, and not an array:

SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1

enter image description here

And if I want to get back a row without the grouping id, nor the tt prefix:

SELECT tt.*
FROM (
  SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
  FROM table t
  GROUP BY 1
)

enter image description here

And that's how you de-duplicate rows based on the rows ids.

If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)

like image 66
Felipe Hoffa Avatar answered Jun 16 '26 14:06

Felipe Hoffa