I ran into some freak undocumented behavior of time-partitioned bigquery tables:
I created a time-partitioned table in BigQuery and inserted data. I was able to insert normally - data was written to today's partition (I was also able to explicitly specify a partition and write into it)
After some tests with new data, I deleted today's partition, in order to have clean data:(CLI)
bq --project_id=my-project rm v1.mytable$20160613
I then checked whether it's empty:
select count(*) from [v1.mytable]
Result 270 instead of 0
I tried deleting again and rerunning the query - same result. So I queried
select count(*) from [v1.mytable$20160613]
Result 0
so a couple of previous dates in which I may have inserted data, but all were 0. Finally I ran
SELECT partition_id from [v1.mytable$__PARTITIONS_SUMMARY__];
and the result was
{ UNPARTITIONED 20160609 20160613 }
and all the data was in fact in UNPARTITIONED
My questions:
A partitioned table is a special table that is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance, and you can control costs by reducing the number of bytes read by a query.
The maximum number of partitions — Using partitioning on BigQuery tables, there is a hard limit of 4,000 partitions per partitioned table.
In summary, partition itself may not get you better performance. It is quite possible when you partition your queries even start getting slower because now there is one more function to be processed between your query and data.
While data is in the streaming buffer, it remains in the UNPARTITIONED partition. To address this partition in a query, you can use the value NULL for the _PARTITIONTIME pseudo column.
SELECT ... FROM mydataset.mypartitioned_table WHERE _PARTITIONTIME IS NULL
To delete data for a given partition, we suggest doing a write truncate to it with a query that returns an empty result. For example:
bq query --destination_table=mydataset.mypartitionedtable\$20160121 --replace 'SELECT 1 as field1, "one" as field2 FROM (SELECT 1 as field1, "one" as field2) WHERE FALSE'
Note that the partition will still be around (if you do a SELECT * from table$__PARTITIONS__SUMMARY), but it will have 0 rows.
$ bq query 'SELECT COUNT(*) from [mydataset.mypartitionedtable$20160121]'
+-----+
| f0_ |
+-----+
| 0 |
+-----+
This is a temporary state -- querying an hour later the records all belonged to today's partition.
The effect is thus similar to a delay in data write: querying immediately after the insert may not have the most recent data in the correct partition, but eventually this will be ok
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With