Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently use Batch writes to cassandra using datastax java driver?

I need to write in Batches to Cassandra using Datastax Java driver and this is my first time I am trying to use batch with datastax java driver so I am having some confusion -

Below is my code in which I am trying to make a Statement object and adding it to Batch and setting the ConsistencyLevel as QUORUM as well.

Session session = null;
Cluster cluster = null;

// we build cluster and session object here and we use  DowngradingConsistencyRetryPolicy as well
// cluster = builder.withSocketOptions(socketOpts).withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)

public void insertMetadata(List<AddressMetadata> listAddress) {
    // what is the purpose of unloggedBatch here?
    Batch batch = QueryBuilder.unloggedBatch();

    try {
        for (AddressMetadata data : listAddress) {
            Statement insert = insertInto("test_table").values(
                    new String[] { "address", "name", "last_modified_date", "client_id" },
                    new Object[] { data.getAddress(), data.getName(), data.getLastModifiedDate(), 1 });
            // is this the right way to set consistency level for Batch?
            insert.setConsistencyLevel(ConsistencyLevel.QUORUM);
            batch.add(insert);
        }

        // now execute the batch
        session.execute(batch);
    } catch (NoHostAvailableException e) {
        // log an exception
    } catch (QueryExecutionException e) {
        // log an exception
    } catch (QueryValidationException e) {
        // log an exception
    } catch (IllegalStateException e) {
        // log an exception
    } catch (Exception e) {
        // log an exception
    }
}

And below is my AddressMetadata class -

public class AddressMetadata {

    private String name;
    private String address;
    private Date lastModifiedDate;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public Date getLastModifiedDate() {
        return lastModifiedDate;
    }

    public void setLastModifiedDate(Date lastModifiedDate) {
        this.lastModifiedDate = lastModifiedDate;
    }
}

Now my question is - Does the way I am using Batch to insert into cassandra with Datastax Java Driver is correct? And what about retry policies, meaning if batch statement execution failed, then what will happen, will it retry again?

And is there any better way of using batch writes to cassandra using java driver?

like image 887
john Avatar asked Oct 08 '14 19:10

john


People also ask

Why use batch in Cassandra?

In Cassandra BATCH is used to execute multiple modification statements (insert, update, delete) simultaneously. It is very useful when you have to update some column as well as delete some of the existing.

How does Cassandra batch work?

Batches are supported using CQL3 or modern Cassandra client APIs. In each case you'll be able to specify a list of statements you want to execute as part of the batch, a consistency level to be used for all statements and an optional timestamp. You'll be able to batch execute INSERT, DELETE and UPDATE statements.

What is batch statement in Cassandra?

The batch statement combines multiple data modification language statements (such as INSERT, UPDATE, and DELETE) to achieve atomicity and isolation when targeting a single partition or only atomicity when targeting multiple partitions.


2 Answers

First a bit of a rant:

The batch keyword in Cassandra is not a performance optimization for batching together large buckets of data for bulk loads.

Batches are used to group together atomic operations, actions that you expect to occur together. Batches guarantee that if a single part of your batch is successful, the entire batch is successful.

Using batches will probably not make your mass ingestion run faster

Now for your questions:

what is the purpose of unloggedBatch here?

Cassandra uses a mechanism called batch logging in order to ensure a batch's atomicity. By specifying unlogged batch, you are turning off this functionality so the batch is no longer atomic and may fail with partial completion. Naturally, there is a performance penalty for logging your batches and ensuring their atomicity, using unlogged batches will removes this penalty.

There are some cases in which you may want to use unlogged batches to ensure that requests (inserts) that belong to the same partition, are sent together. If you batch operations together and they need to be performed in different partitions / nodes, you are essentially creating more work for your coordinator. See specific examples of this in Ryan's blog:

Read this post

Now my question is - Does the way I am using Batch to insert into cassandra with Datastax Java Driver is correct?

I don't see anything wrong with your code here, just depends on what you're trying to achieve. Dig into that blog post I shared for more insight.

And what about retry policies, meaning if batch statement execution failed, then what will happen, will it retry again?

A batch on its own will not retry on its own if it fails. The driver does have retry policies but you have to apply those separately.

The default policy in the java driver only retries in these scenarios:

  • On a read timeout, if enough replica replied but data was not retrieved.
  • On a write timeout, if we timeout while writing the distributed log used by batch statements.

Read more about the default policy and consider less conservative policies based on your use case.

like image 179
phact Avatar answered Sep 18 '22 05:09

phact


We debated for a while between using async and batches. We tried out both to compare. We got better throughput using "unlogged batches" compared to individual "async" requests. We dont know why, but based on Ryan's blog, I am guessing it has got to do with the write size. We probably are doing too many smaller writes and so batching them probably gave us better performance as it does reduce network traffic.

I have to mention that we are not even doing "unlogged batches" in the recommended way. The recommended way is to do a batch with a single-partition key. Basically, batch all the records which belong to the same partition key. But, we were just batching some records which probably belong to different partitions.

Someone did some benchmarking to compare async and "unlogged batches" and we found that quite useful. Here is the link.

like image 36
Chandra Avatar answered Sep 21 '22 05:09

Chandra