I am trying to scan BigTable data where some rows are 'dirty' - but this fails depending on the scan, causing (serialization?) InvalidChunk exceptions. the code is as follows:
from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)
for key, row in table.scan(limit=5000): #BOOM!
pass
leaving out some columns or limiting the rows to less or specifying the start and stop keys, allows the scan to succeed. I cannot detect which values are problematic from the stacktrace - it varies across columns - the scan just fails. This makes it problematic to clean the data at source.
When I leverage the python debugger, I see that the chunk (which is of type google.bigtable.v2.bigtable_pb2.CellChunk) has no value (it is NULL / undefined):
ipdb> pp chunk.value
b''
ipdb> chunk.value_size
0
I can confirm this with the HBase shell from the rowkey ( i got from self._row.row_key)
So the question becomes: How can a BigTable scan filter-out columns which have undefined / empty / null value ?
I get the same problem from both google cloud APIs that return generators which internally stream data as chunks over gRPC:
the abbreviated stacktrace is as follows:
---------------------------------------------------------------------------
InvalidChunk Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
1 row_gen = table.scan(limit=n)
2 rows = []
----> 3 for kvp in row_gen:
4 pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
391 while True:
392 try:
--> 393 partial_rows_data.consume_next()
394 for row_key in sorted(rows_dict):
395 curr_row_data = rows_dict.pop(row_key)
.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
273 for chunk in response.chunks:
274
--> 275 self._validate_chunk(chunk)
276
277 if chunk.reset_row:
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
388 self._validate_chunk_new_row(chunk)
389 if self.state == self.ROW_IN_PROGRESS:
--> 390 self._validate_chunk_row_in_progress(chunk)
391 if self.state == self.CELL_IN_PROGRESS:
392 self._validate_chunk_cell_in_progress(chunk)
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
368 self._validate_chunk_status(chunk)
369 if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370 _raise_if(not chunk.timestamp_micros or not chunk.value)
371 _raise_if(chunk.row_key and
372 chunk.row_key != self._row.row_key)
.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
439 """Helper for validation methods."""
440 if predicate:
--> 441 raise InvalidChunk(*args)
InvalidChunk:
Can you show me how to scan BigTable from Python, ignoring / logging dirty rows that raise InvalidChunk? (try ... except wont work around the generator,which is in the google cloud API row_data PartialRowsData class)
Also, can you show me code to chunk stream a table scan in BigTable? HappyBase batch_size & scan_batching don't seem to be supported.
In general, a cluster's performance scales linearly as you add nodes to the cluster. For example, if you create an SSD cluster with 10 nodes, the cluster can support up to 100,000 rows per second for a typical read-only or write-only workload.
To guarantee strong consistency you must limit queries to a single cluster in an instance by using an application profile. BigTable Performance Example: What will happen to your data in a Bigtable instance if a node goes down? Answer: Nothing, as the storage is separated from the node compute.
Design your row key based on the queries you will use to retrieve the data. Well-designed row keys get the best performance out of Bigtable. The most efficient Bigtable queries retrieve data using one of the following: Row key.
Consistency modelBy default, replication for Bigtable is eventually consistent. This term means that when you write a change to one cluster, you will eventually be able to read that change from the other clusters in the instance, but only after the change is replicated among the clusters.
This was likely due to this bug: https://github.com/googleapis/google-cloud-python/issues/2980
The bug has been fixed, so this should no longer be an issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With