I am planning to append incremental data on a daily basis to a BigQuery table. Each time I add incremental data to the existing table, I want to eliminate duplicate records (based on a primary key column) from the existing data in the table. One approach would be to -
INCR_KEYS
)SELECT all_cols from table where pkey_col NOT IN (INCR_KEYS)
- and store the results in a new table.My concern with this approach is that it creates a duplicate copy of a big table and adds to my bills.
Is there a better way of achieving the same without creating a duplicate table?
I din't know of a way to do this without creating a duplicate table -- this actually sounds like a pretty clever solution.
The incremental cost to you, however, is likely to be very small -- BigQuery only bills you for data for the length of time that it exists. If you delete the old table, you'd only need to pay for both tables for a period of seconds or minutes.
You could run the query with the destination table set to the existing table and set the write disposition to truncate:
bq query --allow_large_results --replace --destination_table=mydataset.mytable \
'SELECT * FROM mydataset.mytable
WHERE key NOT IN (SELECT key FROM mydataset.update)'
bq cp --append_table mydataset.update mydataset.mytable
I believe this will work but I think it is worth taking a backup, especially as you can delete it soon after.
bq cp mydataset.mytable mydataset.backup
# You can also build the new table in one pass:
bq query --allow_large_results --replace --destination_table=mydataset.mytable \
'SELECT * FROM (
SELECT * FROM mydataset.mytable
WHERE key NOT IN (SELECT key FROM mydataset.update)
), (
SELECT * FROM mydataset.update
)'
bq rm mydataset.backup
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With