This is a follow-up to the question Why doesn't BigQuery perform as well on small data sets.
Let's suppose I have a data-set that is ~1M rows. In the current database that we're using (mysql) aggregation queries would run quite slow, perhaps taking ~10s or so on complex aggregations. On BigQuery, the initialization time required might make this query take ~3 seconds, better than in mysql, but the wrong tool for the job, if we need to return queries in 1s or under.
My question then is, what would be a good alternative to using BigQuery on doing aggregated queries on moderate-sized data-sets, such as 1-10M rows? An example query might be:
SELECT studio, territory, count(*)
FROM mytable
GROUP BY studio, territory
ORDER BY count(*) DESC
Possible solutions I've thought of are ElasticSearch (https://github.com/NLPchina/elasticsearch-sql) and Redshift (postgres is too slow). What would be a good option here that can be queried via SQL?
Note: I'm not looking for why or how BQ should be used, I'm looking for an alternative for data sets under 10M rows where the query can be returned in under ~1s.
However, despite its unique advantages and powerful features, BigQuery is not a silver bullet. It is not recommended to use it on data that changes too often and, due to its storage location bound to Google's own services and processing limitations it's best not to use it as a primary data storage.
BigQuery is suitable for “heavy” queries, those that operate using a big set of data. The bigger the dataset, the more you're likely to gain performance by using BigQuery.
Snowflake allows administrators to scale their compute and storage resources up and down independently. BigQuery is "serverless" — compute and storage resources can scale independently, and all scaling issues are handled automatically.
2020 update: Check out BigQuery BI Engine, the built-in accelerator of queries for dashboards:
If you need answers in less than a second, you need to think about indexing.
Typical story:
BigQuery is awesome because it gives you 4. But you are asking for 3, MySQL is fine for that, Elasticsearch is fine too, any indexed database will bring you results in less than a second - as long as you invest time on optimizing your system for certain type of question. Then to get answers for any arbitrary question without investing any optimization time, use BigQuery.
BigQuery: Will answer arbitrary questions in seconds, no preparation needed.
MySQL and alternatives: Will answer certain type of questions in less than a second, but it will take development time to get there.
Here are a few alternatives to consider for data of this size:
If low admin / quick start is critical go with Redshift. If money / flexibility is critical start with Drill. If you prefer MySQL start with MariaDB Columnstore.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With