Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google BigQuery pricing

I'm a Phd student from Singapore Management University. Currently I'm working in Carnegie Mellon University on a research project which needs the historical events from Github Archive (http://www.githubarchive.org/). I noticed that Google Bigquery has Github Archive data. So I run a program to crawl data using Google Bigquery service.

I just found that the price of Google bigquery shows on the console is not updated in real-time... While I started running the program for a few hours, the fee was only 4 dollar plus, so I thought the price is reasonable and I kept running the program. After 1~2 days, I checked the price again on Sep 13, 2013, the price became 1388$...I therefore immediately stopped using Google bigquery service. And just now I checked the price again, it turns out I need to pay 4179$...

It is my fault that I didn't realize I need to pay this big amount of money for executing queries and obtaining data from Google bigquery.

This project is only for research, not for commercial purpose. I would like to know whether it is possible to waive the fee. I really need [Google Bigquery team]'s kindly help.

Thank you very much & Best Regards, Lisa

like image 957
dodoro Avatar asked Sep 16 '13 17:09

dodoro


People also ask

Is Google BigQuery free?

BigQuery has two components of its free tier: one for storage (10GB) and one for analysis 1TB/month. So, if you keep your usage under those limits, you'll never get charged.

Is BigQuery cheap?

There are two components to BigQuery pricing: storage and queries. BigQuery's storage charges are incredibly cheap. It costs two cents per gigabyte per month, which is the same price as Cloud Storage Standard.

Is Google BigQuery open source?

Getting started with the Open Source Insights dataset As with all other Google Cloud Datasets, users can obtain access without charges of up to 1TB/month in queries and up to 10GB/month in storage through BigQuery's free tier.


1 Answers

A year later update:

Please note some big developments since this situation:

  • Querying prices are 85% down.
  • GithubArchive is publishing daily and yearly tables now - so while developing your queries always test them on smaller datasets.

BigQuery pricing is based on the amount of data queried. One of its highlights is how easily it scales, going from scanning few gigabytes to terabytes in very few seconds.

Pricing scaling linearly is a feature: Most (or all?) other databases I know of would require exponentially more expensive resources, or are just not able to handle these amounts of data - at least not in a reasonable time frame.

That said, linear scaling means that a query over a terabyte is a 1000 times more expensive than a query over a gigabyte. BigQuery users need to be aware of this and plan accordingly. For these purposes BigQuery offers the "dry run" flag, that allows one to see exactly how much data will be queried before running the query - and adjust accordingly.

In this case WeiGong was querying a 105 GB table. Ten SELECT * LIMIT 10 queries will quickly amount to a terabyte of data, and so on.

There are ways to make these same queries consume much less data:

  • Instead of querying SELECT * LIMIT 10, call only the columns you are looking for. BigQuery charges based on the columns you are querying, so having unnecessary columns, will add unnecessary costs.

For example, SELECT * ... queries 105 GB, while SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] only goes through 8.72 GB, making this query more than 10 times less expensive.

  • Instead of "SELECT *" use tabledata.list when looking to download the whole table. It's free.

  • Github archive table contains data for all time. Partition it if you only want to see one month data.

For example, extracting all of the January data with a query leaves a new table of only 91.7 MB. Querying this table is a thousand times less expensive than the big one!

SELECT *
FROM [githubarchive:github.timeline]
WHERE created_at BETWEEN '2014-01-01' and '2014-01-02'
-> save this into a new table 'timeline_201401'

Combining these methods you can go from a $4000 bill, to a $4 one, for the same amount of quick and insightful results.

(I'm working with Github archive's owner to get them to store monthly data instead of one monolithic table to make this even easier)

like image 116
Felipe Hoffa Avatar answered Sep 28 '22 17:09

Felipe Hoffa