Cassandra and Tombstones: Creating a Row , Deleting the Row, Recreating the Row = Performance?

Tags:

Could someone please explain, what effect the following process has on tombstones:

1.)Creating a "Row" with Key "1" ("Fields": user, password, date)

2.)Deleting the "Row" with Key "1"

3.)Creating a "Row" with Key "1" ("Fields": user, password,logincount)

The sequence is executed in one thread sequentially (so this happens with a relatively high "speed" = no long pauses between the actions).

My Questions:

1.) What effect does this have on the creation of a tombstone. After 2.) a tombstone is created/exists. But what happens to the existing tombstone, if the new (slightly changed row) is created again under the same key (in process Step 3.)). Can cassandra "reanimate" the tombstones very efficiently?)

2.) How much worse is the process described above in comparison to only very targetly deleting the date "field" and then creating the "logincount" field instead? (It will most likely be more performant. But on the contrary it is much more complex to find out which fields have been deleted in comparison to just simply delete the whole row and recreate it from scratch with the correct data...)

Remark/Update:

What I actually want to do is, setting the "date" field to null. But this does not work in cassandra. Nulls are not allowed for values. So in case I want to set it to null I have to delete it. But I am afraid that this explicit second delete request will have a negative performance impact (compared to just setting it to null)...And as described I have to first find out which fields are nulliefied and foremost had a value (I have to compare all atributes for this state...)

Thank you very much! Markus

482

asked Sep 03 '11 15:09

Markus

2 Answers

I would like to belatedly clarify some things here.

First, with respect to Theodore's answer:

1) All rows have a tombstone field internally for simplicity, so when the new row is merged with the tombstone, it just becomes "row with new data, that also remembers that it was once deleted at time X." So there is no real penalty in that respect.

2) It is incorrect to say that "If you create and delete a column value rapidly enough that no flush takes place in the middle... the tombstone [is] simply discarded"; tombstones are always persisted, for correctness. Perhaps the situation Theodore was thinking was the other way around: if you delete, then insert a new column value, then the new column replaces the tombstone (just as it would any obsolete value). This is different from the row case since the Column is the "atom" of storage.

3) Given (2), the delete-row-and-insert-new-one is likely to be more performant if there are many columns to be deleted over time. But for a single column the difference is negligible.

Finally, regarding Tyler's answer, in my opinion it is more idiomatic to simply delete the column in question than to change its value to an empty [byte]string.

193

answered Sep 30 '22 18:09

jbellis

1). If you delete the whole row, then the tombstone is still kept and not reanimated by the subsequent insertion in step 3. This is because there may have been an insertion for the row a long time ago (e.g. step 0: key "1", field "name"). Row "1" key "name" needs to stay deleted, while row "1" key "user" is reanimated.

2). If you create and delete a column value rapidly enough that no flush takes place in the middle, there is no performance impact. The column will be updated in-place in the Memtable, and the tombstone simply discarded. Only a single value will end up being written persistently to an SSTable.

However, if the Memtable is flushed to disk between steps 2 and 3, then the tombstone will be written to the resulting SSTable. A subsequent flush will write the new value to the next SSTable. This will make subsequent reads slower, since the column now needs to be read from both SSTables and reconciled. (Similarly if a flush occurs between steps 1 and 2.)

answered Sep 30 '22 16:09

Theodore Hong

Related questions
                            
                                Perfmon File Analysis Tools
                            
                                PHP Opcode Caching/Zend Acceleration and include_once vs. require_once
                            
                                How to manage a CPU intensive process on a server
                            
                                Descending sort order indexes
                            
                                Interpreting Java reflection performance: Why is it surprisingly very fast?
                            
                                C#: Create CPU Usage at Custom Percentage
                            
                                Poor numpy.cross() performance
                            
                                Javascript: Does modifying scrollTop/scrollLeft trigger browser reflow?
                            
                                Higher level languages with C functions
                            
                                What are the pros and cons of including Javascript right before the </head> tag vs the </body> tag?
                            
                                Fast algorithm for finding prime numbers? [duplicate]
                            
                                Choosing a Data structure for very large data
                            
                                F# vs. C# performance Signatures with sample code
                            
                                z-index, how does it affect performance?
                            
                                json column vs multiple columns
                            
                                PHP PDO vs normal mysqli speed performance benchmark [closed]
                            
                                haskell matrix implemetation performance
                            
                                How to approach Java 2D performance variations between different computers?
                            
                                Azure table storage performance - REST vs. StorageClient
                            
                                Overhead of DLL function call

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Cassandra and Tombstones: Creating a Row , Deleting the Row, Recreating the Row = Performance?

Tags:

performance

cassandra

tombstone

Markus

People also ask

2 Answers

jbellis

Theodore Hong

Recent Activity

Donate For Us