Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WRITE_TRUNCATE behaviour in Big Query

I have a question around the WRITE_TRUNCATE behaviour in Big Query.

I have a big query table (T1) which I'm periodically appending to with log data (one row per log line). I want to have a dataflow job (D1) that reads from this table, removes any duplicate rows and performs other data cleansing operations and then outputs this to another big query table (T2), replacing any data that may have already been present in this table. I believe I can do this by using the WRITE_TRUNCATE write disposition in the BigQuery.IO sink within the dataflow job.

Question is, if I have another dataflow job (D2) reading from table T2 while job D1 is in the middle of a write truncate to this table, what data does D2 see, i.e. does it see the table in either the state it was in before the truncate or after the truncate has finished. Or can it see the table during any step during the truncate (e.g. part way through appending the new data)?

The javadoc linked above suggests that the truncate may not be atomic while the REST documentation for Big Query suggests that it is.

like image 219
hbakkum Avatar asked Sep 13 '17 00:09

hbakkum


People also ask

What is Write_truncate BigQuery?

WRITE_TRUNCATE: If the table already exists, BigQuery overwrites the table data and uses the schema from the load. WRITE_APPEND: If the table already exists, BigQuery appends the data to the table. WRITE_EMPTY: If the table already exists and contains data, a 'duplicate' error is returned in the job result.

How do you Unnest an array in BigQuery?

To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.

Can we create temp table in BigQuery?

BigQuery uses temporary tables to cache query results that aren't written to a permanent table. The tables are created in a special dataset and named randomly. You can also create temporary tables for your own use within multi-statement queries and sessions.


1 Answers

The REST API is actually the source of truth here, i.e. the change is atomic upon the BigQuery job's successful completion.

like image 94
Michael Moursalimov Avatar answered Oct 13 '22 00:10

Michael Moursalimov