Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Eliminating duplicate records in a BigQuery Table

I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table. One approach would be to -

  1. Collect the set of keys from the incremental data (lets call it INCR_KEYS)
  2. Run a query on the lines of - SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS) - and store the results in a new table.
  3. Append the incremental data to the new table.

My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills.

Is there a better way of achieving the same without creating a duplicate table?

like image 478
user1659408 Avatar asked Sep 10 '12 07:09

user1659408


2 Answers

I din't know of a way to do this without creating a duplicate table -- this actually sounds like a pretty clever solution.

The incremental cost to you, however, is likely to be very small -- BigQuery only bills you for data for the length of time that it exists. If you delete the old table, you'd only need to pay for both tables for a period of seconds or minutes.

like image 92
Jordan Tigani Avatar answered Sep 18 '22 09:09

Jordan Tigani


You could run the query with the destination table set to the existing table and set the write disposition to truncate:

bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM mydataset.mytable
     WHERE key NOT IN (SELECT key FROM mydataset.update)'
bq cp --append_table mydataset.update mydataset.mytable

I believe this will work but I think it is worth taking a backup, especially as you can delete it soon after.

bq cp mydataset.mytable mydataset.backup
# You can also build the new table in one pass:
bq query --allow_large_results --replace --destination_table=mydataset.mytable \
    'SELECT * FROM (
         SELECT * FROM mydataset.mytable
         WHERE key NOT IN (SELECT key FROM mydataset.update)
     ), (
         SELECT * FROM mydataset.update
     )'
bq rm mydataset.backup
like image 29
Siddartha Naidu Avatar answered Sep 18 '22 09:09

Siddartha Naidu