Google Cloud Bigtable looks fantastic, however I have some questions about backups and redundancy.
Are there any options for backing up data to protect against human errors?
Clusters currently run in a single zone - are there any ways to mitigate against a zone being unavailable?
One way to backup your data that's available today is to run an export MapReduce as described here:
https://cloud.google.com/bigtable/docs/exporting-importing#export-bigtable
You are correct that as of today, Bigtable Cluster availability is tied to the availability of the Zone they run in. If stronger availability is a concern, you can look at various methods for replicating your writes (such as kafka) but be aware that this adds other complexity to the system you are building such as managing consistency between clusters. (What happens if there is a bug in your software, and you skip distribution of some writes?)
Using a different system such as Cloud Datastore avoids this problem, as it is not a single zonal system - but it provides other tradeoffs to consider.
It seems that replication feature is not available at this stage so I'm seeing the following options given that read access to Write Ahead Log (or whatever the name of BigTable TX log is) is not provided:
In Google We Trust. Rely on their expertise in ensuring availability and recovery. One of the attractions of hosted BigTable to HBase developers is lower administrative overhead, not having to worry about backups and recovery.
Deploy a secondary BigTable cluster in a different AZ and send it a copy of each Mutation in async mode, with more aggressive write buffering on the client since low latency is not a priority. You can even deploy a regular HBase cluster instead of BigTable cluster but the extent to which Google's HBase client and Apache HBase client can co-exist in the same runtime remains to be seen.
Copy Mutations to local file, offloaded on schedule to a GCP storage classes of choice: standard or DRA. Replay the files on recovery.
A variation of 3). Stand-up a Kafka cluster, distributed across multiple availability zones. Implement a producer and send Mutations to Kafka, its throughput should be higher than BigTable/HBase anyway. Keep track of offset and replay Mutations by consuming messages from Kafka on recovery.
Another thought... if history is any lesson, AWS didn't have Multi-AZ option from the very start. It took them a while to evolve.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With