Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In time-partitioned bigquery tables, when is data written to __UNPARTITIONED__? what are the effects?

I ran into some freak undocumented behavior of time-partitioned bigquery tables:

I created a time-partitioned table in BigQuery and inserted data. I was able to insert normally - data was written to today's partition (I was also able to explicitly specify a partition and write into it)

After some tests with new data, I deleted today's partition, in order to have clean data:(CLI)

bq --project_id=my-project rm v1.mytable$20160613

I then checked whether it's empty:

select count(*) from [v1.mytable]

Result 270 instead of 0

I tried deleting again and rerunning the query - same result. So I queried

select count(*) from [v1.mytable$20160613]

Result 0

so a couple of previous dates in which I may have inserted data, but all were 0. Finally I ran

SELECT partition_id from [v1.mytable$__PARTITIONS_SUMMARY__];

and the result was

{ UNPARTITIONED 20160609 20160613 }

and all the data was in fact in UNPARTITIONED

My questions:

  1. When is the data written to this special partition instead of the daily partition, and how can I avoid this?
  2. Are there other effects, except from losing the ability to address specific dates (in query, or when deleting data, etc.)? should I take care for this case?
like image 331
Ran Avnimelech Avatar asked Jun 13 '16 14:06

Ran Avnimelech


People also ask

What are partitioned tables in BigQuery?

A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.

How many maximum number of partitions are possible in a BigQuery partitioned table?

The maximum number of partitions — Using partitioning on BigQuery tables, there is a hard limit of 4,000 partitions per partitioned table.

Does partitioning improve query performance?

In summary, partition itself may not get you better performance. It is quite possible when you partition your queries even start getting slower because now there is one more function to be processed between your query and data.


2 Answers

  1. While data is in the streaming buffer, it remains in the UNPARTITIONED partition. To address this partition in a query, you can use the value NULL for the _PARTITIONTIME pseudo column.

    SELECT ... FROM mydataset.mypartitioned_table WHERE _PARTITIONTIME IS NULL

  2. To delete data for a given partition, we suggest doing a write truncate to it with a query that returns an empty result. For example:

    bq query --destination_table=mydataset.mypartitionedtable\$20160121 --replace 'SELECT 1 as field1, "one" as field2 FROM (SELECT 1 as field1, "one" as field2) WHERE FALSE'

Note that the partition will still be around (if you do a SELECT * from table$__PARTITIONS__SUMMARY), but it will have 0 rows.

$ bq query 'SELECT COUNT(*) from [mydataset.mypartitionedtable$20160121]'

+-----+
| f0_ |
+-----+
|   0 |
+-----+
like image 200
Pavan Edara Avatar answered Jan 04 '23 16:01

Pavan Edara


This is a temporary state -- querying an hour later the records all belonged to today's partition.

The effect is thus similar to a delay in data write: querying immediately after the insert may not have the most recent data in the correct partition, but eventually this will be ok

like image 27
Ran Avnimelech Avatar answered Jan 04 '23 17:01

Ran Avnimelech