Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

totalBytesBilled is different from totalBytesProcessed

I am using the streak BigQuery developer tool and noticed some wierd behaviour in the "Query Cost". when digging into the details, i found out a weird behavior in the totalBytesBilled and totalBytesProcessed properties. but i had some trouble understanding it...

From the BigQuery resource:

  • statistics.query.totalBytesBilled :Total bytes billed for the job.
  • statistics.query.totalBytesProcessed : Total bytes processed for the job.

The description of these 2 properties is pretty vague...

Based on my past experience I expect these 2 to be the same after I have consumed the free portion of my quota.

A sample query on the sample data set

SELECT word,    word_count 
FROM [publicdata:samples.shakespeare] S
LIMIT 1000

returned:

   "totalBytesProcessed": "2650191",
   "totalBytesBilled": "10485760",
  1. Can someone please give better explanation what are these properties and what is the difference between them?
  2. How come for some (pretty small) queries I get totalBytesBilled significantly higher than totalBytesProcessed?
  3. How are they calculated?
  4. Any tips for optimizing my queries to minimize "totalBytesBilled"
  5. in https://cloud.google.com/bigquery/pricing#on_demand it says: "High Compute Tiers apply for queries that consume extraordinarily large computing resources relative to the amount of bytes scanned. For example, queries that contain a very large number of JOIN or CROSS JOIN clauses, or complex user-defined functions (UDFs) with large processing requirements." Can you please be more specific? How many is "very large number of join clauses"? What makes a UDF "complex"?

Thanks

like image 913
N.N. Avatar asked Sep 17 '15 09:09

N.N.


1 Answers

Query pricing is documented in full here: https://cloud.google.com/bigquery/pricing#on_demand

To give specific answers to your questions:

  1. The totalBytesProcessed field tells you how much data is processed (read) by your query. The totalBytesBilled field tells you how many bytes you were actually billed for. These are usually the same, but may differ in some corner cases (see below) or when you run a "high compute" query (see link above).

  2. There is a 10 MB minimum per query and 10 MB minimum per table referenced, to account for overhead. These minimums (documented above) are the cause of the discrepancy you noted. These charges were previously applied at the time your bill is generated but were not previously reported via this API. With the addition of the totalBytesBilled field, we can now show you these additional billing details. (Note that the actual dollar amounts involved here are incredibly small: at $5/TB, 10 MB is $0.000005. If you ran the daily limit of 100,000 queries at the 10 MB minimum, you'd only pay $5.)

  3. The data size calculation is documented here, and the link above explains how this data size calculation translates to a per-query price.

  4. Generally speaking, only reference the data you care about. Consider using table wildcard functions or table decorators to limit the range of data your queries scan. Note that the LIMIT operator limits the size of the results but not the amount of data scanned/billed!

  5. We can't give specific numbers because there are lots of variables that can affect how computationally intensive a query is. Joins (and in particular cross joins) can be expensive because they can multiply the amount of data processed by the query, which then consumes more resources than we budget for the query. UDFs can be expensive because they can perform large computations (nested loops, complex control flow) for each row. However, a handful of joins whose output is proportional to the input size, or a UDF that performs a modest computation proportional to the input data size, should still fall in tier 1.

    One way to think about this change is that we have a budget for the amount of computational resources that we can throw at a given query based on the totalBytesProcessed. New features like UDFs make it easier for queries to exceed that budget, and we wanted to provide a way for users to pay for High Compute queries rather than just causing their queries to fail.

    If you'd like to plan for this change, you can observe the totalBytesBilled and billingTier fields to see which of your queries will need to be run at a higher tier. If you choose to run your query at a higher tier, see here for details on how to opt in on a per-query or per-project basis.

like image 86
Jeremy Condit Avatar answered Nov 04 '22 04:11

Jeremy Condit