I am using the streak BigQuery developer tool and noticed some wierd behaviour in the "Query Cost". when digging into the details, i found out a weird behavior in the totalBytesBilled and totalBytesProcessed properties. but i had some trouble understanding it...
From the BigQuery resource:
The description of these 2 properties is pretty vague...
Based on my past experience I expect these 2 to be the same after I have consumed the free portion of my quota.
A sample query on the sample data set
SELECT word, word_count
FROM [publicdata:samples.shakespeare] S
LIMIT 1000
returned:
"totalBytesProcessed": "2650191",
"totalBytesBilled": "10485760",
Thanks
Query pricing is documented in full here: https://cloud.google.com/bigquery/pricing#on_demand
To give specific answers to your questions:
The totalBytesProcessed
field tells you how much data is processed (read) by your query. The totalBytesBilled
field tells you how many bytes you were actually billed for. These are usually the same, but may differ in some corner cases (see below) or when you run a "high compute" query (see link above).
There is a 10 MB minimum per query and 10 MB minimum per table referenced, to account for overhead. These minimums (documented above) are the cause of the discrepancy you noted. These charges were previously applied at the time your bill is generated but were not previously reported via this API. With the addition of the totalBytesBilled
field, we can now show you these additional billing details. (Note that the actual dollar amounts involved here are incredibly small: at $5/TB, 10 MB is $0.000005. If you ran the daily limit of 100,000 queries at the 10 MB minimum, you'd only pay $5.)
The data size calculation is documented here, and the link above explains how this data size calculation translates to a per-query price.
Generally speaking, only reference the data you care about. Consider using table wildcard functions or table decorators to limit the range of data your queries scan. Note that the LIMIT
operator limits the size of the results but not the amount of data scanned/billed!
We can't give specific numbers because there are lots of variables that can affect how computationally intensive a query is. Joins (and in particular cross joins) can be expensive because they can multiply the amount of data processed by the query, which then consumes more resources than we budget for the query. UDFs can be expensive because they can perform large computations (nested loops, complex control flow) for each row. However, a handful of joins whose output is proportional to the input size, or a UDF that performs a modest computation proportional to the input data size, should still fall in tier 1.
One way to think about this change is that we have a budget for the amount of computational resources that we can throw at a given query based on the totalBytesProcessed
. New features like UDFs make it easier for queries to exceed that budget, and we wanted to provide a way for users to pay for High Compute queries rather than just causing their queries to fail.
If you'd like to plan for this change, you can observe the totalBytesBilled
and billingTier
fields to see which of your queries will need to be run at a higher tier. If you choose to run your query at a higher tier, see here for details on how to opt in on a per-query or per-project basis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With