We are running BigQuery for about 4 months now and we have some interesting observations and questions regarding pricings:
- Pricing model is quite unexplained it needs explanations for joins, unions and so on
- If I run join on small table my data processed will increase for each copy of small table that was multiplied inside bigquery engine?
- If I reuse same sub-query inside a query this is new data processing regardless it is the same subquery? Can this be optimized in any way? Doesn't bigquery cache it's data?
- Is there minimum size that is rounded up for data processing? Our observations think it is 4MB, is this correct?
- In Google Console there is a problem with geo locations, it says that most of the queries are done from CA which is incorrect since my company is in Europe and all development is done there. I checked all IP addresses and there is no CA ip. Is it possible to get more detailed report?
- Regarding to 5, we would like to get history of queries and price for each one. How can we do this? Must our software do this or?
- Regarding 3) I have this huge problem with data processing (this is why I asked all the questions). My entire dataset is about 150mb large and yet my DP size per query is 3 times larger? If bigquery reads my entire data it would not reach the sizes it shows on data processed. When are all the cases that data will be re-read and thus new DP will take place? And how do you count this?
Thank you.
Regards,
Tomislav
- The pricing is pretty simple for joins, unions, etc. Whatever source data needs to be read, gets charged exactly once. If you do a self-join, for example, you still only get charged for a single read of the data.
- See #1. Regardless of how bigquery works internally, you should only get charged once.
- BigQuery does cache data on a best-effort basis. If you re-run a query, the cache is checked. If there is a cache hit, you will not get charged for re-running the query. Note that only full query results are cached; if you modify the query or use it in a subquery, the cache will not be used.
- BigQuery rounds up to 1MB increments, with a 10 MB minimum.
- I wasn't aware of any geo-location in google console. Where do you see this information?
- You can get a history of queries and the amount of data processed by each one via the jobs.list api.
- Can you send a job id of a query that you ran that was larger than expected. The calculation that is done is
8 bytes * (# of rows of integer and float columns read) + 2 bytes * (total length of all strings in string columns read).
Let me know if this answers your questions.