Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pros & cons of BigQuery vs. Amazon Redshift [closed]

Comparing Google BigQuery vs. Amazon Redshift shows that both can answer same set of requirements, differ mostly by cost plans. It seems that Redshift is more complex to configure (defining keys and optimization work) vs. Google BigQuery that perhaps has an issue with joining tables.

Is there a pros & cons list of Google BigQuery vs. Amazon Redshift?

like image 262
user2339344 Avatar asked Oct 13 '14 12:10

user2339344


People also ask

What pros mean?

adverb. in favor of a proposition, opinion, etc. noun, plural pros. a proponent of an issue; a person who upholds the affirmative in a debate. an argument, consideration, vote, etc., for something.

What pros and cons mean?

Definition of pros and cons 1 : arguments for and against —often + of Congress weighed the pros and cons of the new tax plan. 2 : good points and bad points Each technology has its pros and cons.

Is pros positive or negative?

When weighing options, we use “pros” to describe positives while using cons to describe negatives. The idiom “pro and con” compares the advantages and disadvantages of something with the intention to aid in the decision-making process.

Are pros advantages?

The pros and cons of something are its advantages and disadvantages, which you consider carefully so that you can make a sensible decision.


2 Answers

I posted this comparison on reddit. Quickly enough a long term RedShift practitioner came to comment on my statements. Please see https://www.reddit.com/r/bigdata/comments/3jnam1/whats_your_preference_for_running_jobs_in_the_aws/cur518e for the full conversation.

Sizing your cluster:

  • Redshift will ask you to choose a number of CPUs, RAM, HD, etc. and to turn them on.
  • BigQuery doesn't care. Use it whenever you want, no provisioning needed.

Hourly costs when doing nothing:

  • Redshift will ask you to pay per hour of each of these servers running, even when you are doing nothing.
  • When idle BigQuery only charges you $0.02 per month per GB stored. 2 cents per month per GB, that's it.

Speed of queries:

  • Redshift performance is limited by the amount of CPUs you are paying for
  • BigQuery transparently brings in as many resources as needed to run your query in seconds.

Indexing:

  • Redshift will ask you to index (correction: distribute) your data under certain criteria, and you'll only be able to run fast queries based on this index.
  • BigQuery has no indexes. Every operation is fast.

Vacuuming:

  • Redshift requires periodic maintenance and 'vacuum' operations that last hours. You are paying for each of these server hours.
  • BigQuery does not. Forget about 'vacuuming'.

Data partitioning and distributing:

  • Redshift requires you to think about how to distribute data within your servers to keep performance up - optimization that works only for certain queries.
  • BigQuery does not. Just run whatever query you want.

Streaming live data:

  • Impossible(?) with Redshift.
  • BigQuery easily handles ingesting up to 100,000 rows per second per table.

Growing your cluster:

  • If you have more data, or more concurrent users scaling up will be painful with Redshift.
  • BigQuery will just work.

Multi zone:

  • You want a multi-zone Redshift for availability and data integrity? Painful.
  • BigQuery is multi-zoned by default.

To try BigQuery you don't need a credit card or any setup time. Just try it (quick instructions to try BigQuery).

When you are ready to put your own data into BigQuery, just copy your JSON new-line separated logs from to Google Cloud Storage and import them.

See this in depth guide to data warehouse pricing on the cloud: Understanding Cloud Pricing Part 3.2 - More Data Warehouses

like image 186
Felipe Hoffa Avatar answered Oct 21 '22 02:10

Felipe Hoffa


Amazon Redshift is a standard SQL database (based on Postgres) with MPP features that allow it to scale. These features also require you to conform your data model somewhat to get the best performance. It supports a large amount of the SQL standard and most tools that can speak to Postgres can use it unchanged.

BigQuery is not a database, in the sense that there it doesn't use standard SQL and doesn't provide JDBC/ODBC connectivity. It's a unique service with it's own API and interfaces. It provides limited support for SQL queries but most users interact with via custom code (Java, Python, etc.). Some 3rd party tools have added support for BigQuery but existing tools will not work without modification.

tl;dr - Redshift is better for interacting with existing tools and using complex SQL. BigQuery is better for custom coded interactions and teams who dislike SQL.

UPDATE 2017-04-17 - Here's a much more up to date summary of the cost and speed differences (wrapped in a sales pitch so YMMV). TL;DR - Redshift is usually faster and will be cheaper if you query the data somewhat regularly. http://blog.panoply.io/a-full-comparison-of-redshift-and-bigquery


UPDATE - Since I keep getting down votes on this (🤷‍♂️) here's an up-to-date response to the items in the other answer:

Sizing your cluster:

  • Redshift allows you to tailor your costs to your usage. If you want the fastest possible queries choose SSD nodes and if you want the lowest possible cost per GB choose HDD nodes. Start small and add nodes whenever you want.

Hourly costs when doing nothing:

  • Redshift keeps your cluster ready for queries, can respond in milliseconds (result cache) and it provides a simple, predictable monthly bill.
  • For example, even if some script accidentally runs 10,000 giant queries over the weekend your Redshift bill will not increase at all.

Speed of queries:

  • Redshift performance is absolutely best in class and gets faster all the time. 3-5x faster in the last 6 months.

Indexing:

  • Redshift has no indexes. It allows you to define sort keys to optimize performance from fast to insanely fast.

Vacuuming:

  • Redshift now automatically runs routine maintenance such as ANALYZE and VACUUM DELETE when your cluster has free resource.

Data partitioning and distributing:

  • Redshift never requires distribution. It allows you to define distribution keys which can make even huge joins very fast.
  • {Ask competitors about join performance…}

Streaming live data:

  • Redshift has 2 choices
    • Stream real time data into Redshift using Amazon Kinesis Firehose.
    • Skip ingestion altogether by querying your real time instantly on S3 as soon as it land (and at high speeds) using Redshift Spectrum external tables.

Growing your cluster:

  • Redshift can elastically resize most clusters in a few minutes.

Multi zone:

  • Redshift seamlessly replaces any failed hardware and continuously backs up your data, including across regions if desired.
like image 42
Joe Harris Avatar answered Oct 21 '22 03:10

Joe Harris