I know how a B-Tree works in-memory, it's easy enough to implement. However, what is currently completely beyond me, is how to find a data layout that works effectively on disk, such that: <ul> <li>The number of entries in the B-Tree can grow indefinitly (or at least to > 1000GB)</li> <li>Disk-level copying operations are minimized</li> <li>The values can have arbitrary size (i.e. no fixed schema)</li> </ul> If anyone could provide insight into layouting B-Tree structures on disk level, I'd be very grateful. Especially the last bullet point gives me a lot of headache. I would also appreciate pointers to books, but most database literature I've seen explains only the high-level structure (i.e. "this is how you do it in memory"), but skips the nitty gritty details on the disk layout.

UPDATE(archived version of oracle index internals): http://web.archive.org/web/20161221112438/http://www.toadworld.com/platforms/oracle/w/wiki/11001.oracle-b-tree-index-from-the-concept-to-internals <hr> OLD (the original link does not exist anymore): some info about oracle index internals: http://www.toadworld.com/platforms/oracle/w/wiki/11001.oracle-b-tree-index-from-the-concept-to-internals Notes: Databases do not directly implement indexes based on B-tree but on a variant called B+ tree. Which according to wikipedia: <blockquote> A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves. </blockquote> Databases work, in general, with block-oriented storage and b+ tree is more suited then a b-tree for this. The blocks are fixed size and are left with some free space to accommodate future changes in value or key size. A block can be either a leaf(holds actual data) or branch(holds the pointers to the leaf nodes) A toy model how writing to disk can be implemented (for a block size 10k for arithmetic simplification): <ol> <li>a file of 10G is created on disk(it has 1000 of blocks)</li> <li>first block is assigned as root and the next free one as a leaf and a list of leaf addresses is put in root</li> <li>new data inserted, the current leaf node is filled with values until a threshold is reached</li> <li>data continue to be inserted, the next free ones are allocated as leaf blocks and the list of leaf nodes is updated <ol start="5"> <li>after many inserts, the current root node needs children, so the next free block is allocated as branch node, it copies the list from root and now the root will maintains only a list of intermediary nodes. </li> <li>if node block needs to be split, the next free block is allocated as branch node, added into root list, and list of leaf nodes is split between initial and new branch node</li> </ol> </li> </ol> When the information is read from a big index: can go following: <ol> <li>read first/root block (seek(0), read(10k)) which points to the a child which is located in block 900</li> <li>read block 900 (seek(900*10k), read(10K)) which points to a child which located in block 5000</li> <li>read block 5000 (seek(5000*10k), read(10K)) which points to the leaf node located in block 190</li> <li>read block 190 (seek(190*10k), read(10K)) and extract the interested value from it</li> </ol> a really large index can be split on multiple files, then the address of block will be something as (filename_id, address_relative_to_this_file)

How to lay out B-Tree data on disk?

Tags:

database

b-tree

disk

b-tree-index

I know how a B-Tree works in-memory, it's easy enough to implement. However, what is currently completely beyond me, is how to find a data layout that works effectively on disk, such that:

The number of entries in the B-Tree can grow indefinitly (or at least to > 1000GB)
Disk-level copying operations are minimized
The values can have arbitrary size (i.e. no fixed schema)

If anyone could provide insight into layouting B-Tree structures on disk level, I'd be very grateful. Especially the last bullet point gives me a lot of headache. I would also appreciate pointers to books, but most database literature I've seen explains only the high-level structure (i.e. "this is how you do it in memory"), but skips the nitty gritty details on the disk layout.

276

asked Nov 22 '16 11:11

Alan47

1 Answers

UPDATE(archived version of oracle index internals): http://web.archive.org/web/20161221112438/http://www.toadworld.com/platforms/oracle/w/wiki/11001.oracle-b-tree-index-from-the-concept-to-internals

OLD (the original link does not exist anymore): some info about oracle index internals: http://www.toadworld.com/platforms/oracle/w/wiki/11001.oracle-b-tree-index-from-the-concept-to-internals

Notes:

Databases do not directly implement indexes based on B-tree but on a variant called B+ tree. Which according to wikipedia:

A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves.

Databases work, in general, with block-oriented storage and b+ tree is more suited then a b-tree for this.

The blocks are fixed size and are left with some free space to accommodate future changes in value or key size.

A block can be either a leaf(holds actual data) or branch(holds the pointers to the leaf nodes)

A toy model how writing to disk can be implemented (for a block size 10k for arithmetic simplification):

a file of 10G is created on disk(it has 1000 of blocks)
first block is assigned as root and the next free one as a leaf and a list of leaf addresses is put in root
new data inserted, the current leaf node is filled with values until a threshold is reached
data continue to be inserted, the next free ones are allocated as leaf blocks and the list of leaf nodes is updated
1. after many inserts, the current root node needs children, so the next free block is allocated as branch node, it copies the list from root and now the root will maintains only a list of intermediary nodes.
2. if node block needs to be split, the next free block is allocated as branch node, added into root list, and list of leaf nodes is split between initial and new branch node

When the information is read from a big index: can go following:

read first/root block (seek(0), read(10k)) which points to the a child which is located in block 900
read block 900 (seek(900*10k), read(10K)) which points to a child which located in block 5000
read block 5000 (seek(5000*10k), read(10K)) which points to the leaf node located in block 190
read block 190 (seek(190*10k), read(10K)) and extract the interested value from it

a really large index can be split on multiple files, then the address of block will be something as (filename_id, address_relative_to_this_file)

164

answered Sep 30 '22 11:09

valentin

Related questions
                            
                                Adding multiple columns in MySQL with one statement
                            
                                In Laravel, how do I retrieve a random user_id from the Users table for Model Factory seeding data generation?
                            
                                Sorting on the server or on the client?
                            
                                What is the difference between "conflict serializable" and "conflict equivalent"?
                            
                                Mysql user creation script
                            
                                Database on the fly with scripting languages
                            
                                How I can change prefixes in all tables in my MySQL DB?
                            
                                How to get database field type in Laravel?
                            
                                what is the difference between triggers, assertions and checks (in database)
                            
                                Removing all decimals in PHP
                            
                                Why use your application-level cache if database already provides caching?
                            
                                Integer vs String in database
                            
                                The type or namespace name 'SQLConnection' could not be found
                            
                                Monitoring Mongo for changes with Node.js
                            
                                Create a Map in Golang from database Rows
                            
                                best database book for developers [closed]
                            
                                Redis: Database Size to Memory Ratio?
                            
                                Android synching data between users
                            
                                Best way to archive live MySQL database
                            
                                Data Migration from Legacy Data Structure to New Data Structure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With