Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why B-Tree for file systems?

I know this is a common question and I saw a few threads in Stack Overflow but still couldn't get it.

Here is an accepted answer from Stack overflow:

" Disk seeks are expensive. B-Tree structure is designed specifically to avoid disk seeks as much as possible. Therefore B-Tree packs much more keys/pointers into a single node than a binary tree. This property makes the tree very flat. Usually most B-Trees are only 3 or 4 levels deep and the root node can be easily cached. This requires only 2-3 seeks to find anything in the tree. Leaves are also "packed" this way, so iterating a tree (e.g. full scan or range scan) is very efficient, because you read hundreds/thousands data-rows per single block (seek).

In binary tree of the same capacity, you'd have several tens of levels and sequential visiting every single value would require at least one seek. "

I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST.

But these nodes are again stored as linked lists right?

I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os.

Isn't the same argument hold good for BSTs too? Except that the links will be downwards?

Please someone explain it to me?

like image 718
user1189332 Avatar asked Dec 01 '22 14:12

user1189332


1 Answers

I understand that B-Tree has more nodes (Order) than a BST. So it's definitely flat and shallow than a BST. I don't understand when they say that the keys are read as a block thereby minimising the no of I/Os. Isn't the same argument hold good for BSTs too? Except that the links will be downwards?

Basically, the idea behind using a B+tree in file systems is to reduce the number of disk reads. Imagine that all the blocks in a drive are stored as a sequentially allocated array. In order to search for a specific block you would have to do a linear scan and it would take O(n) every time to find a block. Right?

Now, imagine that you got smart and decided to use a BST, great! You would store all your blocks in a BST an that would take roughly O(log(n)) to find a block. Remember that every branch is a disk access, which is highly expensive!

But, we can do better! The problem now is that a BST is really "tall". Because every node only has a fanout (number of children) factor of 2, if we had to store N objects, our tree would be in the order of log(N) tall. So we would have to perform at most log(N) access to find our leaves.

The idea behind the B+tree structure is to increase the fanout factor (number of children), reducing the height of tree and, thus, reducing the number of disk access that we have to make in order to find a leave. Remember that every branch is a disk access. For instance, if you pack X keys in a node of a B+tree every node will point to at most X+1 children.

Also, remember that a B+tree is structured in a way that only the leaves store the actual data. That way, you can pack more keys in the internal nodes in order to fill up one disk block, that, for instance, stores one node of a B+tree. The more keys you pack in a node the more children it will point to and the shorter your tree will be, thus reducing the number of disk access in order to find one leave.

But these nodes are again stored as linked lists right?

Also, in a B+tree structure, sometimes the leaves are stored in a linked list fashion. Remember that only the leaves store the actual data. That way, with the linked list idea, when you have to perform a sequential access after finding one block you would do it faster than having to traverse the tree again in order to find the next block, right? The problem is that you still have to find the first block! And for that, the B+tree is way better than the linked list.

Imagine that if all the accesses were sequential and started in the first block of the disk, an array would be better than the linked list, because in a linked list you still have to deal with the pointers. But, the majority of disk accesses, according to Tanenbaum, are not sequential and are accesses to files of small sizes (like 4KB or less). Imagine the time it would take if you had to traverse a linked list every time to access one block of 4KB...

This article explains it way better than me and uses pictures as well: https://loveforprogramming.quora.com/Memory-locality-the-magic-of-B-Trees

like image 183
wleao Avatar answered Dec 09 '22 14:12

wleao