I am trying to scan BigTable data where some rows are 'dirty' - but this fails depending on the scan, causing (serialization?) InvalidChunk exceptions. the code is as follows: <pre class="prettyprint"><code>from google.cloud import bigtable from google.cloud import happybase client = bigtable.Client(project=project_id, admin=True) instance = client.instance(instance_id) connection = happybase.Connection(instance=instance) table = connection.table(table_name) for key, row in table.scan(limit=5000): #BOOM! pass </code></pre> leaving out some columns or limiting the rows to less or specifying the start and stop keys, allows the scan to succeed. I cannot detect which values are problematic from the stacktrace - it varies across columns - the scan just fails. This makes it problematic to clean the data at source. When I leverage the python debugger, I see that the chunk (which is of type google.bigtable.v2.bigtable_pb2.CellChunk) has no value (it is NULL / undefined): <pre class="prettyprint"><code>ipdb> pp chunk.value b'' ipdb> chunk.value_size 0 </code></pre> I can confirm this with the HBase shell from the rowkey ( i got from self._row.row_key) So the question becomes: How can a BigTable scan filter-out columns which have undefined / empty / null value ? I get the same problem from both google cloud APIs that return generators which internally stream data as chunks over gRPC: <ul> <li>google.cloud.happybase.table.Table# scan()</li> <li>google.cloud.bigtable.table.Table# read_rows().consume_all()</li> </ul> the abbreviated stacktrace is as follows: <pre class="prettyprint"><code>--------------------------------------------------------------------------- InvalidChunk Traceback (most recent call last) <ipython-input-48-922c8127f43b> in <module>() 1 row_gen = table.scan(limit=n) 2 rows = [] ----> 3 for kvp in row_gen: 4 pass .../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs) 391 while True: 392 try: --> 393 partial_rows_data.consume_next() 394 for row_key in sorted(rows_dict): 395 curr_row_data = rows_dict.pop(row_key) .../site-packages/google/cloud/bigtable/row_data.py in consume_next(self) 273 for chunk in response.chunks: 274 --> 275 self._validate_chunk(chunk) 276 277 if chunk.reset_row: .../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk) 388 self._validate_chunk_new_row(chunk) 389 if self.state == self.ROW_IN_PROGRESS: --> 390 self._validate_chunk_row_in_progress(chunk) 391 if self.state == self.CELL_IN_PROGRESS: 392 self._validate_chunk_cell_in_progress(chunk) .../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk) 368 self._validate_chunk_status(chunk) 369 if not chunk.HasField('commit_row') and not chunk.reset_row: --> 370 _raise_if(not chunk.timestamp_micros or not chunk.value) 371 _raise_if(chunk.row_key and 372 chunk.row_key != self._row.row_key) .../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args) 439 """Helper for validation methods.""" 440 if predicate: --> 441 raise InvalidChunk(*args) InvalidChunk: </code></pre> Can you show me how to scan BigTable from Python, ignoring / logging dirty rows that raise InvalidChunk? (try ... except wont work around the generator,which is in the google cloud API row_data PartialRowsData class) Also, can you show me code to chunk stream a table scan in BigTable? HappyBase batch_size & scan_batching don't seem to be supported.

This was likely due to this bug: https://github.com/googleapis/google-cloud-python/issues/2980 The bug has been fixed, so this should no longer be an issue.

How to handle BigTable Scan InvalidChunk exceptions?

Tags:

happybase

I am trying to scan BigTable data where some rows are 'dirty' - but this fails depending on the scan, causing (serialization?) InvalidChunk exceptions. the code is as follows:

from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)

for key, row in table.scan(limit=5000):  #BOOM!
    pass

leaving out some columns or limiting the rows to less or specifying the start and stop keys, allows the scan to succeed. I cannot detect which values are problematic from the stacktrace - it varies across columns - the scan just fails. This makes it problematic to clean the data at source.

When I leverage the python debugger, I see that the chunk (which is of type google.bigtable.v2.bigtable_pb2.CellChunk) has no value (it is NULL / undefined):

ipdb> pp chunk.value
b''
ipdb> chunk.value_size
0

I can confirm this with the HBase shell from the rowkey ( i got from self._row.row_key)

So the question becomes: How can a BigTable scan filter-out columns which have undefined / empty / null value ?

I get the same problem from both google cloud APIs that return generators which internally stream data as chunks over gRPC:

google.cloud.happybase.table.Table# scan()
google.cloud.bigtable.table.Table# read_rows().consume_all()

the abbreviated stacktrace is as follows:

---------------------------------------------------------------------------
InvalidChunk                              Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
      1 row_gen = table.scan(limit=n) 
      2 rows = []
----> 3 for kvp in row_gen:
      4     pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
    391         while True:
    392             try:
--> 393                 partial_rows_data.consume_next()
    394                 for row_key in sorted(rows_dict):
    395                     curr_row_data = rows_dict.pop(row_key)

.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
    273         for chunk in response.chunks:
    274 
--> 275             self._validate_chunk(chunk)
    276 
    277             if chunk.reset_row:

.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
    388             self._validate_chunk_new_row(chunk)
    389         if self.state == self.ROW_IN_PROGRESS:
--> 390             self._validate_chunk_row_in_progress(chunk)
    391         if self.state == self.CELL_IN_PROGRESS:
    392             self._validate_chunk_cell_in_progress(chunk)

.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
    368         self._validate_chunk_status(chunk)
    369         if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370             _raise_if(not chunk.timestamp_micros or not chunk.value)
    371         _raise_if(chunk.row_key and
    372                   chunk.row_key != self._row.row_key)

.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
    439     """Helper for validation methods."""
    440     if predicate:
--> 441         raise InvalidChunk(*args)

InvalidChunk:

Can you show me how to scan BigTable from Python, ignoring / logging dirty rows that raise InvalidChunk? (try ... except wont work around the generator,which is in the google cloud API row_data PartialRowsData class)

Also, can you show me code to chunk stream a table scan in BigTable? HappyBase batch_size & scan_batching don't seem to be supported.

488

asked Nov 24 '16 14:11

Josh Reuben

1 Answers

This was likely due to this bug: https://github.com/googleapis/google-cloud-python/issues/2980

The bug has been fixed, so this should no longer be an issue.

answered Sep 21 '22 15:09

Ramesh Dharan

Related questions
                            
                                Pandas series.rename gives TypeError: 'str' object is not callable error
                            
                                Using a DLL exported from D
                            
                                qt5 (and python) - How to get the default thumbnail size of user's os
                            
                                Clean separation between application thread and Qt thread (Python - PyQt)
                            
                                NaN results in tensorflow Neural Network
                            
                                TypeError: initial_value must be unicode or None, not str,
                            
                                Getting no such table error using pandas and sqldf
                            
                                MySQL Stored Procedures, Pandas, and "Use multi=True when executing multiple statements"
                            
                                Iterate over itertools.product in different order without ever creating list
                            
                                How to call pypdfocr functions to use them in a python script?
                            
                                Share DB connection in a process pool
                            
                                Getting covariance matrix of fitted parameters from scipy optimize.least_squares method
                            
                                Python with Selenium: Is it possible to refresh frame instead of the whole page?
                            
                                Is it possible to specify the driver dll directly in the ODBC connection string?
                            
                                Run Python + OpenCV + dlib in Azure Functions
                            
                                PyYAML dumping boolean
                            
                                Workflow for adding new columns from Pandas to SQLite tables
                            
                                Vim on Ubuntu 14.04 uses a funny python path, python can't import _io among other modules
                            
                                Overcome memory constraints while using multiprocessing
                            
                                Can a Tkinter button have children?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to handle BigTable Scan InvalidChunk exceptions?

Tags:

python

google-cloud-bigtable

happybase

Josh Reuben

People also ask

1 Answers

Ramesh Dharan

Recent Activity

Donate For Us