BigQuery - DELETE statement to remove duplicates

Tags:

google-bigquery

There are plenty of great posts on SQL that selects unique rows and write (truncates) a table so the dus are removed. e.g

WITH ev AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
)
SELECT
  * EXCEPT(rowNum)
FROM
  ev
WHERE rowNum = 1

I was trying to explore this slightly differently using DML and DELETE (e.g if you don't want to use a BQ savedQuery, just execute SQL). What I want to do is roughly:

WITH dup_events AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC) AS rowNum
  FROM `duplicates`
)
DELETE FROM
  dup_events
WHERE rowNum > 1

but got this error in the console:

Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [10:1]

Can this be achieved (standardSQL) using DELETE?

216

asked Jan 06 '18 18:01

Kurt Maile

3 Answers

Below actually :o) works

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
        SELECT AS STRUCT id, MAX(loadTime) loadTime 
        FROM `yourproject.yourdataset.duplicates` 
        GROUP BY id)

Note: it assumes that loadTime is also unique - meaning if for given id there are more than one record with latest loadTime - they all will be preserved

answered Oct 19 '22 01:10

Mikhail Berlyant

Those answers above works only for small size table. If you have a large size partition table, and only want to remove duplicates in a given range, use the SQL below:

-- WARNING: back up the table before this operation
-- FOR large size timestamp partitioned table 
-- -------------------------------------------
-- -- To de-duplicate rows of a given range of a partition table, using surrage_key as unique id
-- -------------------------------------------

DECLARE dt_start DEFAULT TIMESTAMP("2019-09-17T00:00:00", "America/Los_Angeles") ;
DECLARE dt_end DEFAULT TIMESTAMP("2019-09-22T00:00:00", "America/Los_Angeles");

MERGE INTO `gcp_project`.`data_set`.`the_table` AS INTERNAL_DEST
USING (
  SELECT k.*
  FROM (
    SELECT ARRAY_AGG(original_data LIMIT 1)[OFFSET(0)] k 
    FROM `gcp_project`.`data_set`.`the_table` AS original_data
    WHERE stamp BETWEEN dt_start AND dt_end
    GROUP BY surrogate_key
  )

) AS INTERNAL_SOURCE
ON FALSE

WHEN NOT MATCHED BY SOURCE
  AND INTERNAL_DEST.stamp BETWEEN dt_start AND dt_end -- remove all data in partiion range
    THEN DELETE

WHEN NOT MATCHED THEN INSERT ROW

credit: https://gist.github.com/hui-zheng/f7e972bcbe9cde0c6cb6318f7270b67a

answered Oct 19 '22 00:10

Hui Zheng

This has to be the easiest way:

create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)

If you have an array data type, try this:

-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3]  as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);

-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
  SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)]  col
  FROM DW.pmoTest tbl
  GROUP BY ID
  )
);

answered Oct 19 '22 00:10

PetArbiter

Related questions
                            
                                Getting module 'google.protobuf.descriptor_pool' has no attribute 'Default' in my python script
                            
                                Bigquery console does not show all tables
                            
                                How to run a BigQuery query in Python
                            
                                What to try to get BigQuery to CAST BYTES to STRING?
                            
                                Serviceaccount does not have bigquery.jobs.create permission
                            
                                Re-use BigQuery window function partition
                            
                                REGEXP_MATCH in BigQuery Standard SQL
                            
                                How to query BigQuery programmatically from Python without end-user interaction?
                            
                                BigQuery Where Date is Less Than or Equal to 3 Days Minus Current Date
                            
                                Is there a way to do rolling averages in Big Query?
                            
                                BigQuery equivalent of COALESCE()?
                            
                                Is there a easy way to get the data with timestamp == yesterday?
                            
                                Is user_pseudo_id the same as a a session id? How to group all events by session? - Firebase BigQuery
                            
                                Querying multiple repeated fields in BigQuery
                            
                                How to improve performance of GeoIP query in BigQuery?
                            
                                How to load multiple files (same schema) into a table in BigQuery?
                            
                                Computing Percentiles In BigQuery
                            
                                Getting Error: redirect_uri_mismatch The redirect URI in the request: http://localhost:8080/oauth2callback did not match a registered redirect URI
                            
                                Google BigQuery Underlying Architecture
                            
                                Bigquery - Insert new data row into table by python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With