Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing big data warehouse

Nowadays I've been challenged by creating data warehouse to store and process huge amount of data. Estimated amount is over 7 billions events per day. The data should be kept for the 7 days. Average event size is ~0.5 - 1 Kb. We need to process the data to:

  • generate reports;
  • train models.

Currently I'm evaluating:

  • Google Bigquery
  • Redshift
  • Stratio + Cassandra + AWS + EMR + EBS
  • Cloudera + AWS

So I'm interested in:

  • solution you use inside your company( frameworks, setup, database, amount of nodes, etc )
  • any real cost examples/comparison if possible
  • management complexity( devops )
like image 744
Yuli Reiri Avatar asked May 24 '16 11:05

Yuli Reiri


People also ask

How do I choose a cloud data warehouse?

Make sure you consider what your company needs and the use case of teams. If you're mostly using your data warehouse for machine learning and data science, your needs will be much different than if you want to provide on-going, ad-hoc analysis or self-service analytics to your entire company.

What are the 4 key points of the data warehouse environment?

A typical data warehouse has four main components: a central database, ETL (extract, transform, load) tools, metadata, and access tools.


1 Answers

I recently wrote this summary based on Mark Lit's series comparing BigQuery, Spark, Hive, Presto, ElasticSearch, AWS Redshift, AWS EMR, and Google Dataproc:

https://cloud.google.com/blog/big-data/2016/05/bigquery-and-dataproc-shine-in-independent-big-data-platform-comparison

Summary of the summary:

  • Same dataset (1 billion rows), same queries, many technologies and configurations.
  • BigQuery was the fastest to run queries: 2 seconds.
  • BigQuery was the only one fast by default: There were no optimizations needed or data pre-processing required. 1 billion rows were loaded in 25 minutes, and data was ready to be queried.
  • Other solutions took hours to load data (at a significant cost), and were many times slower than BigQuery.

But the best benchmark you can get is your own: Trying BigQuery should be fast and easy. Then try to find another platform that loads data as fast, queries it as fast, or gets close to it in price. Mark tried, and those were his findings.

like image 54
Felipe Hoffa Avatar answered Sep 30 '22 08:09

Felipe Hoffa