Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle BigTable Scan InvalidChunk exceptions?

I am trying to scan BigTable data where some rows are 'dirty' - but this fails depending on the scan, causing (serialization?) InvalidChunk exceptions. the code is as follows:

from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)

for key, row in table.scan(limit=5000):  #BOOM!
    pass

leaving out some columns or limiting the rows to less or specifying the start and stop keys, allows the scan to succeed. I cannot detect which values are problematic from the stacktrace - it varies across columns - the scan just fails. This makes it problematic to clean the data at source.

When I leverage the python debugger, I see that the chunk (which is of type google.bigtable.v2.bigtable_pb2.CellChunk) has no value (it is NULL / undefined):

ipdb> pp chunk.value
b''
ipdb> chunk.value_size
0

I can confirm this with the HBase shell from the rowkey ( i got from self._row.row_key)

So the question becomes: How can a BigTable scan filter-out columns which have undefined / empty / null value ?

I get the same problem from both google cloud APIs that return generators which internally stream data as chunks over gRPC:

  • google.cloud.happybase.table.Table# scan()
  • google.cloud.bigtable.table.Table# read_rows().consume_all()

the abbreviated stacktrace is as follows:

---------------------------------------------------------------------------
InvalidChunk                              Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
      1 row_gen = table.scan(limit=n) 
      2 rows = []
----> 3 for kvp in row_gen:
      4     pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
    391         while True:
    392             try:
--> 393                 partial_rows_data.consume_next()
    394                 for row_key in sorted(rows_dict):
    395                     curr_row_data = rows_dict.pop(row_key)

.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
    273         for chunk in response.chunks:
    274 
--> 275             self._validate_chunk(chunk)
    276 
    277             if chunk.reset_row:

.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
    388             self._validate_chunk_new_row(chunk)
    389         if self.state == self.ROW_IN_PROGRESS:
--> 390             self._validate_chunk_row_in_progress(chunk)
    391         if self.state == self.CELL_IN_PROGRESS:
    392             self._validate_chunk_cell_in_progress(chunk)

.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
    368         self._validate_chunk_status(chunk)
    369         if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370             _raise_if(not chunk.timestamp_micros or not chunk.value)
    371         _raise_if(chunk.row_key and
    372                   chunk.row_key != self._row.row_key)

.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
    439     """Helper for validation methods."""
    440     if predicate:
--> 441         raise InvalidChunk(*args)

InvalidChunk: 

Can you show me how to scan BigTable from Python, ignoring / logging dirty rows that raise InvalidChunk? (try ... except wont work around the generator,which is in the google cloud API row_data PartialRowsData class)

Also, can you show me code to chunk stream a table scan in BigTable? HappyBase batch_size & scan_batching don't seem to be supported.

like image 488
Josh Reuben Avatar asked Nov 24 '16 14:11

Josh Reuben


People also ask

How do you scale a cloud Bigtable database if more queries per second are needed?

In general, a cluster's performance scales linearly as you add nodes to the cluster. For example, if you create an SSD cluster with 10 nodes, the cluster can support up to 100,000 rows per second for a typical read-only or write-only workload.

What will happen to your data in a bigtable instance if a node goes down?

To guarantee strong consistency you must limit queries to a single cluster in an instance by using an application profile. BigTable Performance Example: What will happen to your data in a Bigtable instance if a node goes down? Answer: Nothing, as the storage is separated from the node compute.

When designing your Cloud Bigtable schema which is a recommended design best practice to help avoid slow query performance?

Design your row key based on the queries you will use to retrieve the data. Well-designed row keys get the best performance out of Bigtable. The most efficient Bigtable queries retrieve data using one of the following: Row key.

Is bigtable eventually consistent?

Consistency modelBy default, replication for Bigtable is eventually consistent. This term means that when you write a change to one cluster, you will eventually be able to read that change from the other clusters in the instance, but only after the change is replicated among the clusters.


1 Answers

This was likely due to this bug: https://github.com/googleapis/google-cloud-python/issues/2980

The bug has been fixed, so this should no longer be an issue.

like image 81
Ramesh Dharan Avatar answered Sep 21 '22 15:09

Ramesh Dharan