Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does BigQuery caching on time partitioned tables work?

In contrast with the BigQuery documentation, we see that it DOES cache the results when selecting data from a streaming, data partitioned table (Standard SQL).

Example: When we perform a deterministic date scan on the streaming, data partitioned table using:

where (_PARTITIONTIME > '2017-11-12' or _PARTITIONTIME is null)

...BigQuery caches the data for 5 to 20 minutes if we fire the same exact query within that time frame.

While in my interpretation of the documentation it states that it SHOULD NOT cache the data:

'When any of the tables referenced by the query have recently received streaming inserts (a streaming buffer is attached to the table) even if no new rows have arrived'

Important notes:

  • Our test query queries heartbeat events that really arrive at us continuously
  • We actually want this caching behavior, because we do not always need to have data to be actual to the last second. We just want to know if we really can depend on this behavior.

Our Questions:

  • What is going on here / Why does the BQ caching happen at all?

  • The time this data stays in the BQ cache is 'random' (between 5-20 minutes). What does this mean?

like image 733
Rogier Werschkull Avatar asked Oct 18 '22 03:10

Rogier Werschkull


1 Answers

Thanks for clarifying the question. I think it's an overlook that we didn't disabled caching for partitioned tables with streaming data. It should as otherwise the query might return outdated results.

We invalidate the cache when the table is changed. Streaming into the table will cause the table to be changed. I guess that's why the cache is invalidated between 5 to 20 minutes.

like image 92
Hua Zhang Avatar answered Oct 21 '22 03:10

Hua Zhang