Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do we need a separate datastructure like B-Tree for database and file system?

Tags:

I am reading up on B Trees and looks like they achieve the dynamic set operations in O(lg n) time . The Red Black Tree( TreeMap in java ) also achieves the same operation in asymptotically the same time frame . So I would like to know what makes B trees more useful for database and file systems

like image 450
Geek Avatar asked Oct 10 '12 12:10

Geek


People also ask

Why are B-trees used in databases?

A B-tree is a tree data structure that keeps data sorted and allows searches, insertions, and deletions in logarithmic amortized time. Unlike self-balancing binary search trees, it is optimized for systems that read and write large blocks of data. It is most commonly used in database and file systems.

Why do database systems use a B+ tree instead of a B-tree?

The principal advantage of B+ trees over B trees is they allow you to pack in more pointers to other nodes by removing pointers to data, thus increasing the fanout and potentially decreasing the depth of the tree. The disadvantage is that there are no early outs when you might have found a match in an internal node.

How is B-tree used in file system?

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children.

Why is a B-tree a better structure than a binary search tree for implementation of an index?

B-Tree : B-Tree is known as a self-balancing tree as its nodes are sorted in the inorder traversal. Unlike the binary trees, in B-tree, a node can have more than two children. B-tree has a height of logM N (Where 'M' is the order of tree and N is the number of nodes).


2 Answers

The main reason for the existence of B-Trees is to better utilize the behaviour of devices that read and write large chunks of data. Two properties are important to make the B-Tree better than binary trees when data has to be stored on disk:

  • Access to disk is really slow (compared to memory or caches, random access to data on disk is orders of magnitude slower); and
  • Every single read causes a whole sector to be loaded from the drive - assuming a sector size of 4K, this means 1000 integers, or tens of some larger objects you're storing.

Hence, we can use the pros of the second fact, while also minimizing the cons - i.e. number of disk accesses.

So, instead of just storing a single number in every node that tells us if we should continue to the left or to the right, we can create a bigger index that tells us if we should continue to the first 1/100, to the second or to the 99-th (imagine books in a library sorted by their first letter, then by the second, and so on). As long as all this data fits on a single sector, it will be loaded anyway, so we might as well use it completely.

This results in roughly logb N lookups, where N is the number of records. This number, while asymptotically the same as log2 N, is actually a few times smaller with large enough N and b - and since we're talking about storing data to disk for use in databases, etc., the amount of data is usually large enough to justify this.

The rest of the design decision are mainly done in order to make this one work efficiently, as modifying a N-ary tree is trickier than a binary one.

like image 94
Ivan Vergiliev Avatar answered Oct 24 '22 17:10

Ivan Vergiliev


RB trees are binary search trees. B trees can have more than two child nodes. In fact, the number of child nodes is variable.

So, you can vary the number of child nodes such that the size of a node is always a multiple of the filesystem block size. This reduces waste when reading: you cannot read less than one block anyway, you always have to read the full block, so you might just as well fill it up with useful data. Increasing the number of child nodes will also decrease the depth of the tree, thus decreasing the average number of "hops" (i.e. disk reads), which again increases performance.

Remember: B trees are usually used to store data structures which are orders of magnitude larger than memory, whereas RB trees are typically used to store data structures which are orders of magnitude smaller than memory. In fact, B trees are specifically designed as an on-disk data structure as opposed to an in-memory data structure.

This is the key sentence from the Wikipedia article (emphasis mine):

the B-tree is optimized for systems that read and write large blocks of data

like image 31
Jörg W Mittag Avatar answered Oct 24 '22 17:10

Jörg W Mittag