Nowadays I've been challenged by creating data warehouse to store and process huge amount of data. Estimated amount is over 7 billions events per day. The data should be kept for the 7 days. Average event size is ~0.5 - 1 Kb. We need to process the data to:
Currently I'm evaluating:
So I'm interested in:
Make sure you consider what your company needs and the use case of teams. If you're mostly using your data warehouse for machine learning and data science, your needs will be much different than if you want to provide on-going, ad-hoc analysis or self-service analytics to your entire company.
A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools.
I recently wrote this summary based on Mark Lit's series comparing BigQuery, Spark, Hive, Presto, ElasticSearch, AWS Redshift, AWS EMR, and Google Dataproc:
https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-comparison
Summary of the summary:
But the best benchmark you can get is your own: Trying BigQuery should be fast and easy. Then try to find another platform that loads data as fast, queries it as fast, or gets close to it in price. Mark tried, and those were his findings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With