Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Alternative to BigQuery for medium-sized data

This is a follow-up to the question Why doesn't BigQuery perform as well on small data sets.

Let's suppose I have a data-set that is ~1M rows. In the current database that we're using (mysql) aggregation queries would run quite slow, perhaps taking ~10s or so on complex aggregations. On BigQuery, the initialization time required might make this query take ~3 seconds, better than in mysql, but the wrong tool for the job, if we need to return queries in 1s or under.

My question then is, what would be a good alternative to using BigQuery on doing aggregated queries on moderate-sized data-sets, such as 1-10M rows? An example query might be:

SELECT studio, territory, count(*)
FROM mytable
GROUP BY studio, territory
ORDER BY count(*) DESC

Possible solutions I've thought of are ElasticSearch (https://github.com/NLPchina/elasticsearch-sql) and Redshift (postgres is too slow). What would be a good option here that can be queried via SQL?

Note: I'm not looking for why or how BQ should be used, I'm looking for an alternative for data sets under 10M rows where the query can be returned in under ~1s.

like image 994
David542 Avatar asked Mar 10 '17 23:03

David542


People also ask

What is BigQuery not good for?

However, despite its unique advantages and powerful features, BigQuery is not a silver bullet. It is not recommended to use it on data that changes too often and, due to its storage location bound to Google's own services and processing limitations it's best not to use it as a primary data storage.

Is BigQuery used for Big Data?

BigQuery is suitable for “heavy” queries, those that operate using a big set of data. The bigger the dataset, the more you're likely to gain performance by using BigQuery.

How is Snowflake different from BigQuery?

Snowflake allows administrators to scale their compute and storage resources up and down independently. BigQuery is "serverless" — compute and storage resources can scale independently, and all scaling issues are handled automatically.


2 Answers

2020 update: Check out BigQuery BI Engine, the built-in accelerator of queries for dashboards:

  • https://cloud.google.com/bi-engine/docs/overview

If you need answers in less than a second, you need to think about indexing.

Typical story:

  1. MySQL (or any other database proposed here) is fast, until...
  2. One day some of your aggregation queries start running slow. Minutes, hours, days, etc.
  3. Typical solution for step 2 is indexing and pre-aggregating. If you want answers in less than a second for certain type of questions, you'll need to invest time and optimization cycles to answer just that type of questions.
  4. BigQuery's beauty is that you can skip step 3. Bring those minutes/hours/days to seconds, with minimal investment - any query, at any time.

BigQuery is awesome because it gives you 4. But you are asking for 3, MySQL is fine for that, Elasticsearch is fine too, any indexed database will bring you results in less than a second - as long as you invest time on optimizing your system for certain type of question. Then to get answers for any arbitrary question without investing any optimization time, use BigQuery.

BigQuery: Will answer arbitrary questions in seconds, no preparation needed.

MySQL and alternatives: Will answer certain type of questions in less than a second, but it will take development time to get there.

like image 193
Felipe Hoffa Avatar answered Sep 27 '22 17:09

Felipe Hoffa


Here are a few alternatives to consider for data of this size:

  1. Single Redshift small SSD node
    • No setup. Easily returns answers on this much data in under 1s. 
  2. Greenplum on a small T2 instance
    • Postgres-like. Similar perf to Redshift. Not paying for storage you won't need. Start with their single node "sandbox" AMI.
  3. MariaDB Columnstore
    • MySQL-like. Used to be called InfiniDB. Very good performance. Supported by MariaDB (the company).
  4. Apache Drill
    • Drill has a very similar philosophy to BiqQuery but can be used to anywhere (it's just a jar). Queries will be fast on this size data.

If low admin / quick start is critical go with Redshift. If money / flexibility is critical start with Drill. If you prefer MySQL start with MariaDB Columnstore.

like image 20
Joe Harris Avatar answered Sep 27 '22 17:09

Joe Harris