Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big table vs Big Query usecase for timeseries data

I am looking to finalize on Big table vs Big Query for my usecase of timeseries data.

I had gone through https://cloud.google.com/bigtable/docs/schema-design-time-series

This is for storing an Omniture data which contains information like website visitor key(some Long key), his cookie id(some Long key), timestamp series data web hits for his IP, cookie

What can be used as the rowkey for Big table? I cannot be using timestamp or CookieId as a prefix, as I learn from the best practices. But should have an identifier(preferably alphabet?) and then followed by timeseries suffix. The data has a volume of 500 Million with 52 columns stored in SQL table today. I think the data might be updated based on OLTP processing. But the table would be later queried on timeseries data for like OLAP processing.

a) Would Big table would be a best option here, or should I be using Big Query since just querying later based on timeseries data would help me more? b) If using Big table, what would be the best row key, since timeseries is the only meaning filter i see for my data. I believe, using other fields in the table like visitorkey, cookieid ids(Long ids) as prefix with timestamp would still cause entire data to be filling up 1 node in Bigtable, instead of distributing.

Please let me know.

like image 267
Roshan Fernando Avatar asked Sep 18 '18 18:09

Roshan Fernando


1 Answers

(I'm an engineer on the Cloud Bigtable Team)

As you've discovered from our docs, the row key format is the biggest decision you make when using Bigtable, as it determines which access patterns can be performed efficiently. Using visitorKey + cookie as a prefix before the timestamp sounds to me like it would avoid hotspotting issues, as there are almost certainly many more visitors to your site than there would be nodes in your cluster. Bigtable serves these sorts of time-series use cases all the time!

However, you're also coming from a SQL architecture, which isn't always a good fit for Bigtable's schema/query model. So here are some questions to get you started:

  • Are you planning to perform lots of ad hoc queries like "SELECT A FROM Bigtable WHERE B=x"? If so, strongly prefer BigQuery. Bigtable can't support this query without performing a full table scan. And in general Bigtable is geared more towards streaming back a simple subset of the data quickly, say, to a Dataflow job, rather than embedding complex processing in the queries themselves.
  • Will you require multi-row OLTP transactions? Again, use BigQuery, as Bigtable only supports transactions within a single row.
  • Are you streaming in new events at high QPS? Bigtable is much better for these sorts of high-volume updates. Remember that Bigtable's original purpose was as a random access sink for web crawler updates in Google's search index!
  • Do you want to perform any sort of large-scale complex transformations on the data? Again, Bigtable is likely better here, as you can stream data out and back in faster and let custom business logic in a Dataflow job do whatever you want to it.

You can also combine the two services if you need some combination of these features. For example, say you're receiving high-volume updates all the time, but want to be able to perform complex ad hoc queries. If you're alright working with a slightly delayed version of the data, it could make sense to write the updates to Bigtable, then periodically scan the table using Dataflow and export a post-processed version of the latest events into BigQuery. GCP also allows BigQuery to serve queries directly from Bigtable in a some regions: https://cloud.google.com/bigquery/external-data-bigtable

like image 109
Douglas McErlean Avatar answered Sep 19 '22 17:09

Douglas McErlean