I am using below query to delete duplicates records from bigquery using standard sql. but it is throwing error
with cte as (
select * ,row_number()over (partition by CallRailCallId order by CallRailCallId) as rn
from `encoremarketingtest.EncoreMarketingTest.CallRailCall2` )
delete
from cte
where rn>1
Query Failed Error: Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [5:5]
Could anyone help me on the correct approach in BigQuery?
If a table has duplicate rows, we can delete it by using the DELETE statement. In the case, we have a column, which is not the part of group used to evaluate the duplicate records in the table.
1) First identify the rows those satisfy the definition of duplicate and insert them into temp table, say #tableAll . 2) Select non-duplicate(single-rows) or distinct rows into temp table say #tableUnique. 3) Delete from source table joining #tableAll to delete the duplicates.
The go to solution for removing duplicate rows from your result sets is to include the distinct keyword in your select statement. It tells the query engine to remove duplicates to produce a result set in which every row is unique.
Option #1
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY CallRailCallId ORDER BY CallRailCallId) rn
FROM `project.dataset.your_table`
)
WHERE rn = 1
Option #2
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY CallRailCallId LIMIT 1)[OFFSET(0)] row
FROM `project.dataset.your_table` t
GROUP BY CallRailCallId
)
As you might noticed, above options using DDL
(CREATE TABLE) approach and that is where it is possible to use just one known (from your question) column - CallRailCallId
Also, note - ORDER BY CallRailCallId
plays no real role there because GROUP BY and PARTITION BY are by exactly same filed. But if you change the field this will control which exactly row (out of few duplicates) to "survive" (For example ORDER BY ts DESC
- see below option for what ts might be)
Option #3
This option uses DML
(DELETE FROM) but requires some extra column to be used to serve as a tie-breaker
For example you have ts
TIMESTAMP field and you want the most recent (based on ts) row to survive
DELETE FROM `project.dataset.your_table`
WHERE STRUCT(CallRailCallId, ts) NOT IN (
SELECT AS STRUCT CallRailCallId, MAX(ts) ts
FROM `project.dataset.your_table`
GROUP BY CallRailCallId
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With