Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra buffered read of millions of columns

I've got a cassandra cluster with a small number of rows (< 100). Each row has about 2 million columns. I need to get a full row (all 2 million columns), but things start failing all over the place before I can finish my read. I'd like to do some kind of buffered read.

Ideally I'd like to do something like this using Pycassa (no this isn't the proper way to call get, it's just so you can get the idea):

results = {}
start = 0
while True:
    # Fetch blocks of size 500
    buffer = column_family.get(key, column_offset=start, column_count=500)
    if len(buffer) == 0:
        break

    # Merge these results into the main one
    results.update(buffer)

    # Update the offset
    start += len(buffer)

Pycassa (and by extension Cassandra) don't let you do this. Instead you need to specify a column name for column_start and column_finish. This is a problem since I don't actually know what the start or end column names will be. The special value "" can indicate the start or end of the row, but that doesn't work for any of the values in the middle.

So how can I accomplish a buffered read of all the columns in a single row? Thanks.

like image 778
Chris Eberle Avatar asked Nov 21 '25 16:11

Chris Eberle


2 Answers

From the pycassa 1.0.8 documentation

it would appear that you could use something like the following [pseudocode]:

results = {}
start = 0
startColumn = ""
while True:
    # Fetch blocks of size 500

   buffer = get(key, column_start=startColumn, column_finish="", column_count=100)
   # iterate returned values. 
   # set startColumn == previous column_finish. 

Remember that on each subsequent call you're only get 99 results returned, because it's also returning startColumn, which you've already seen. I'm not skilled enough in Python yet to iterate on buffer to extract the column names.

like image 184
Chris K Avatar answered Nov 24 '25 06:11

Chris K


In v1.7.1+ of pycassa you can use xget and get a row as wide as 2**63-1 columns.

for col in cf.xget(key, column_count=2**63-1):
    # do something with the column.
like image 22
user1987428 Avatar answered Nov 24 '25 06:11

user1987428



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!