Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Avoid duplicates in bigquery

i'm working with bigquery and in the documentation it's said:

Unlike a traditional RDBMS, there is no notion of primary/secondary or row-id keys. If required, identify a column in the table schema for that purpose.

Do you know how could i insert without duplicates as a pk would do(but not only in the same insert)? Regards

like image 773
Daniel Ruiz Avatar asked Apr 24 '17 10:04

Daniel Ruiz


1 Answers

So let's clear some fact right in the first place. As you cannot insert without duplicates.

Bigquery is a managed data warehouse suitable for large datasets, and it's complimentary not a replacement for traditional databases.

You can only do a maximum of 96 DML (update,delete) operations on a table per day. This is by design. It's low because it forces you to think like as BQ as a data lake.

So on Bigquery you actually let every data in, everything is append-only by design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.

We actually leverage insights from every new update we add to the same row. Like for example we can detect how long did take for the end-user to choose it's country at signup flow. Because we have a dropdown of countries it took some time until it scrolled to his country, and metrics shown this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country, it's faster.

In other words you use Streaming Insert functionality to constantly add new rows. Then you have your SQL queries usually with Window Functions to pick last row.

You cannot update a row, or append to a record as BigQuery limits DML statements to 96 per table.

like image 156
Pentium10 Avatar answered Oct 22 '22 12:10

Pentium10