Eliminating duplicate records in a BigQuery Table

Question

I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table. One approach would be to -

Collect the set of keys from the incremental data (lets call it INCR_KEYS)
Run a query on the lines of - SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS) - and store the results in a new table.
Append the incremental data to the new table.

My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills.

Is there a better way of achieving the same without creating a duplicate table?

Jordan Tigani · Accepted Answer

I din't know of a way to do this without creating a duplicate table -- this actually sounds like a pretty clever solution.

The incremental cost to you, however, is likely to be very small -- BigQuery only bills you for data for the length of time that it exists. If you delete the old table, you'd only need to pay for both tables for a period of seconds or minutes.

Siddartha Naidu · Answer

You could run the query with the destination table set to the existing table and set the write disposition to truncate:

bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM mydataset.mytable
     WHERE key NOT IN (SELECT key FROM mydataset.update)'
bq cp --append_table mydataset.update mydataset.mytable

I believe this will work but I think it is worth taking a backup, especially as you can delete it soon after.

bq cp mydataset.mytable mydataset.backup
# You can also build the new table in one pass:
bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM (
         SELECT * FROM mydataset.mytable
         WHERE key NOT IN (SELECT key FROM mydataset.update)
     ), (
         SELECT * FROM mydataset.update
     )'
bq rm mydataset.backup

Eliminating duplicate records in a BigQuery Table

Tags:

google-bigquery

user1659408

2 Answers

Jordan Tigani

Siddartha Naidu

Recent Activity

Donate For Us

Eliminating duplicate records in a BigQuery Table

Tags:

google-bigquery

user1659408

2 Answers

Jordan Tigani

Siddartha Naidu

Related questions

Recent Activity

Donate For Us