I need to do a batch INSERT
in Cassandra
using Python
.
I am using the latest Datastax
python driver.
The INSERTS
are batches of columns that will be in the same row. I will have many rows to insert, but chunks of the data will be in the same row.
I can do individual INSERTS
in a for loop
as described in this post:
Parameterized queries with the Python Cassandra Module
I am using parametrized query, values as shown in that example.
This did not help: How to multi insert rows in cassandra
I am not clear how to assemble a parameterized INSERT:
BEGIN BATCH
INSERT(query values1)
INSERT(query values2)
...
APPLY BATCH;
cursor.execute(batch_query)
Is this even possible? Will this speed up my INSERTS
? I have to do millions. Even thousands take too long.
I found some Java info:
http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
Since version 2.0.0 of the driver, there is a BatchStatement
construct. If using the ORM, you can also use the BatchQuery
class.
cluster = Cluster([sever_ip])
session = cluster.connect(keyspace)
insert_user = session.prepare('INSERT INTO table_name (id,name) VALUES (?, ?)')
batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)
for i,j in some_value:
try:
batch.add(insert_user,(i,j))
logger.info('Data Inserted into the table')
except Exception as e:
logger.error('The cassandra error: {}'.format(e))
session.execute(batch)
Intro: Right now the DataStax Python driver doesn't support the CQL protocol in Cassandra 2.0 -- it's work in progress and betas will should up soon. At that point you'll be able to have a BATCH
statement to which you can add bound prepared statements as needed.
Considering the above, the solution you could use is the one described in the post you've linked: prepare a statement that includes a BATCH
with a series of INSERT
s. The obvious downside of this solution is that you'd need to decide upfront how many inserts will be in your batch and also you'll have to split your input data accordingly.
Example code:
BATCH_SIZE = 10
INSERT_STMT = 'INSERT INTO T (id, fld1) VALUES (?, ?)';
BATCH_STMT = 'BEGIN BATCH'
for i in range(BATCH_SIZE):
BATCH_STMT += INSERT_STMT
BATCH_STMT += 'APPLY BATCH;'
prep_batch = session.prepare(BATCH_STMT)
Then as you receive data you can iterate over it and for each BATCH_SIZE
rows you bind those to the above prep_batch
and execute it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With