Nowadays I've been challenged by creating data warehouse to store and process huge amount of data. Estimated amount is over 7 billions events per day. The data should be kept for the 7 days. Average event size is ~0.5 - 1 Kb. We need to process the data to: <ul> <li>generate reports;</li> <li>train models.</li> </ul> Currently I'm evaluating: <ul> <li>Google Bigquery </li> <li>Redshift </li> <li>Stratio + Cassandra + AWS + EMR + EBS</li> <li>Cloudera + AWS</li> </ul> So I'm interested in: <ul> <li>solution you use inside your company( frameworks, setup, database, amount of nodes, etc )</li> <li>any real cost examples/comparison if possible </li> <li>management complexity( devops ) </li> </ul>

I recently wrote this summary based on Mark Lit's series comparing BigQuery, Spark, Hive, Presto, ElasticSearch, AWS Redshift, AWS EMR, and Google Dataproc: https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-comparison Summary of the summary: <ul> <li>Same dataset (1 billion rows), same queries, many technologies and configurations.</li> <li>BigQuery was the fastest to run queries: 2 seconds.</li> <li>BigQuery was the only one fast by default: There were no optimizations needed or data pre-processing required. 1 billion rows were loaded in 25 minutes, and data was ready to be queried.</li> <li>Other solutions took hours to load data (at a significant cost), and were many times slower than BigQuery.</li> </ul> But the best benchmark you can get is your own: Trying BigQuery should be fast and easy. Then try to find another platform that loads data as fast, queries it as fast, or gets close to it in price. Mark tried, and those were his findings.

Choosing big data warehouse

Tags:

google-bigquery

amazon-redshift

bigdata

cloudera

cassandra-2.0

Nowadays I've been challenged by creating data warehouse to store and process huge amount of data. Estimated amount is over 7 billions events per day. The data should be kept for the 7 days. Average event size is ~0.5 - 1 Kb. We need to process the data to:

generate reports;
train models.

Currently I'm evaluating:

Google Bigquery
Redshift
Stratio + Cassandra + AWS + EMR + EBS
Cloudera + AWS

So I'm interested in:

solution you use inside your company( frameworks, setup, database, amount of nodes, etc )
any real cost examples/comparison if possible
management complexity( devops )

744

asked May 24 '16 11:05

Yuli Reiri

1 Answers

I recently wrote this summary based on Mark Lit's series comparing BigQuery, Spark, Hive, Presto, ElasticSearch, AWS Redshift, AWS EMR, and Google Dataproc:

https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-comparison

Summary of the summary:

Same dataset (1 billion rows), same queries, many technologies and configurations.
BigQuery was the fastest to run queries: 2 seconds.
BigQuery was the only one fast by default: There were no optimizations needed or data pre-processing required. 1 billion rows were loaded in 25 minutes, and data was ready to be queried.
Other solutions took hours to load data (at a significant cost), and were many times slower than BigQuery.

But the best benchmark you can get is your own: Trying BigQuery should be fast and easy. Then try to find another platform that loads data as fast, queries it as fast, or gets close to it in price. Mark tried, and those were his findings.

answered Sep 30 '22 08:09

Felipe Hoffa

Related questions
                            
                                Promote ephemeral IP to static?
                            
                                gCloud / GCE Disk Size warning - is it meaningful?
                            
                                What is the most elegant and robust way on dataproc to adjust log levels for Spark?
                            
                                Query BigQuery nested/repeated fields
                            
                                Understanding if data exists in Firebase
                            
                                Google Deploy to App Engine disabled in Eclipse
                            
                                How to make a 'log in with Spotify' authentication system using Firebase?
                            
                                Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?
                            
                                How to render newlines in NoSQL data with a web browser?
                            
                                ImportError: No module named appengine.api
                            
                                Started using the new Firebase website and I am getting this error while testing the Android Authentication sample program
                            
                                Getting a.ref is not a function error when upgraded to angularfire 1.2
                            
                                Firebase: How to push in transaction?
                            
                                Find duplicates in app engine datastore
                            
                                Port and Proxy Config on ng-build
                            
                                What are consequences of having GCM SENDER ID being exposed?
                            
                                How to get the ID token from FirebaseAuth
                            
                                Failed to resolve target intent service, Error while delivering the message: ServiceIntent not found
                            
                                How to securely connect to Cloud SQL from Cloud Run?
                            
                                "This app is not authorized to use Firebase Authentication" in Emulator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With