Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google BigQuery There are no primary key or unique constraints, how do you prevent duplicated records being inserted?

Google BigQuery has no primary key or unique constraints.

We cannot use traditional SQL options such as insert ignore or insert on duplicate key update so how do you prevent duplicate records being inserted into Google BigQuery?

If I have to call delete first (based on unique key in my own system) and then insert to prevent duplicate records being inserted into bigquery, wouldn't that that be too inefficient? I would assume that insert is the cheapest operation, no query, just append data. For each insert if I have to call delete, it will be too inefficient and cost us extra money.

What is your advice and suggestions based on your experience?

It would be nice that bigquery has primary key, but it might be conflict with the algorithms/data structure that bigquery is based on?

like image 587
searain Avatar asked Mar 21 '17 21:03

searain


People also ask

How to avoid duplicated records in Google BigQuery?

"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too. "Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".

Is it possible to deduplicate rows in a partitioned subset of BigQuery?

It is as expected! It is very easy to deduplicate rows in BigQuery across the entire table or on a subset of the table, including a partitioned subset.

Is it possible to deduplicate a date in BigQuery?

One of the days — the current date as of writing — is no deduplicated, but none of the other days. It is as expected! It is very easy to deduplicate rows in BigQuery across the entire table or on a subset of the table, including a partitioned subset.

Does BigQuery have auto-incrementing primary keys?

If you’ve ever tried to build an enterprise data warehouse using BigQuery, you’ll quickly realize that the auto-incrementing primary keys you were so fond of in operational databases are not a thing in BigQuery. This is by design.


2 Answers

So let's clear some facts up in the first place.

Bigquery is a managed data warehouse suitable for large datasets, and it's complementary to a traditional database, rather than a replacement.

Up until early 2020 there was only a maximum of 96 DML (update,delete) operations on a table per day. That low limited forced you to think of BQ as a data lake. That limit has been removed but it demonstrates that the early design of the system was oriented around "append-only".

So, on BigQuery, you actually let all data in, and favor an append-only design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.

We actually leverage insights from every new update we add to the same row. For example, we can detect how long it took for the end-user to choose his/her country at signup flow. Because we have a dropdown of countries, it took some time until he/she scrolled to the right country, and metrics show this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country; it's faster.

like image 134
Pentium10 Avatar answered Sep 22 '22 15:09

Pentium10


"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too.

"Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".

When we run Youtube Bigquery Transfer backfill (ask youtube bigquery transfer to push the reports for certain dates again.) Youtube BigQury Transfer services will first, delete the full dataset for that date in the report tables and then insert the full dataset of that date back to the report tables again.

Another approach is drop the results table (if it already exists) first, and then re-create the results table and re-input the results into the tables again. I used this approach a lot. Everyday, I have my process data results saved in some results tables in the daily dataset. If I rerun the process for that day, my script will check if the results tables for that day exist or not. If table exists for that day, delete it and then re-create a fresh new table, and then reinput the process results to the new created table.

like image 20
searain Avatar answered Sep 21 '22 15:09

searain