Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does CouchDB use an append-only B+ tree and not a HAMT

I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages.

The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree.

However, HAMT is also very shallow, so doesn't require any more seeks than a B tree.

Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption that we're talking about a dense vector, and doesn't apply when using either as a hash map.

What's more, it seems that a B tree does more rebalancing, so using it in an append-only manner produces more garbage.

So why do CouchDB and practically every other database and filesystem use B trees?

[edit] fractal trees? log-structured merge tree? mind = blown

[edit] Real-life B trees use a degree in the thousands, while a HAMT has a degree of 32. A HAMT of degree 1024 would be possible, but slower due to popcnt handling 32 or 64 bits at a time.

like image 593
Pepijn Avatar asked Dec 28 '13 16:12

Pepijn


1 Answers

B-trees are used because they are a well-understood algorithm that achieves "ideal" sorted-order read-cost. Because keys are sorted, moving to the next or previous key is very cheap.

HAMTs or other hash storage, stores keys in random order. Keys are retrieved by their exact value, and there is no efficient way to find to the next or previous key.

Regarding degree, it is normally selected indirectly, by selecting page size. HAMTs are most often used in memory, with pages sized for cache lines, while btrees are most often used with secondary storage, where page sizes are related to IO and VM parameters.

Log Structured Merge (LSM) is a different approach to sorted order storage which achieves more optimal write-efficiency, by trading off some read efficiency. That hit to read efficiency can be a problem for read-modify-write workloads, but the fewer uncached reads there are, the more LSM provides better overall throughput vs btree - at the coat of higher worst case read latency.

LSM also offers the promise of a wider-performance envelope. Putting new data into it's proper place is "deferred", offering the possibility to tune read-to-write efficiency by controlling the proportion of deferred cleanup work to live work. In theory, an ideal-LSM with zero-deferral is a btree and with 100%-deferral is a log.

However, LSM is more of a "family" of algorithms than a specific algorithm like a btree. Their usage is growing in popularity, but it is hindered by the lack of a de-facto optimal LSM design. LevelDB/RocksDB is one of the more practical LSM implementations, but it is far from optimal.

Another approach to achieving write-throughput efficiency is to write-optimize btrees through write-deferral, while attempting to maintain their optimal read-throughput.

Fractal-trees, shuttle-trees, stratified-trees are this type of design, and represent a hybrid gray area between btree and LSM. Rather than deferring writes to an offline process, they amortorize write-deferal in a fixed way. For example, such a design might represent a fixed 60%-write-deferral fraction. This means they can't achieve the 100% write-defferal performance of an LSM, but they also have a more predictable read-performance, making them more practical drop-in replacements for btrees. (As in the commercial Tokutek MySQL and MongoDB fractal-tree backends)

like image 57
David Jeske Avatar answered Oct 24 '22 02:10

David Jeske