Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Database stores data internally in B-Tree/B+Tree

My question is that How database stores data and how it performs query internally.

Suppose we have following fields in our table:

  1. ID
  2. Name
  3. Age
  4. Weight
  5. Manager

and we query select * from Table1 where age>50 and weight<100

I am just curious that how it perform query internally.

What will the Node of B-Tre/B+Tree contains in this example?

like image 631
ZeNo Avatar asked Oct 30 '10 17:10

ZeNo


People also ask

How do databases store data internally?

Well, data in tables is stored in row and column format at the logical level, but physically it stores data in something called data pages. A data page is the fundamental unit of data storage in SQL Server and it is 8KB in size.

How does DB store tree structure?

The standard method of storing hierarchical data is simple parent-child relationship. Each record in the database includes a —parent id—, and a recursive query through the records build the children, siblings, and levels of the tree.

How do databases use B trees?

A B-tree is a tree data structure that keeps data sorted and allows searches, insertions, and deletions in logarithmic amortized time. Unlike self-balancing binary search trees, it is optimized for systems that read and write large blocks of data. It is most commonly used in database and file systems.

How does a relational database store data internally?

A relational database stores data in tables. Tables are organized into columns, and each column stores one type of data (integer, real number, character strings, date, "¦). The data for a single "instance" of a table is stored as a row.


1 Answers

The example you have chosen is one of the few cases where a single Tree can't do the job (two independent ranges).

However, the first chapter of my work-in-progress e-Book explains the inner workings of B-Tree indexes: http://use-the-index-luke.com/anatomy/

EDIT for more details why two indexes might be useful for the above example.

The above query can be supported by three possible index configurations:

  1. concatenated index on AGE and then WEIGHT (in this order).
    In case, the query would read all records WHERE AGE > 50 and then filter by WEIGHT.

  2. concatenated index on WEIGHT and then AGE (the other order).
    That goes the different way: read all records WHERE WEIGHT < 100 and then filter by AGE.

Whatever is more efficient depends on the data you have. If there are less records AGE > 50 than WEIGHT < 100 the first will be more efficient, otherwise the second. However, if you query with different values, the picture might change.

The reason that a concatenated index can't support the query well is that each index order is on one axis only. each index entry is before or after another one, but never next to it. All index entries build one chain.

A query that has two independent range queries would require two axes, not like a chain, but more like a chess board. one axis for AGE the other for WEIGHT. If that would be possible, your query would need to scan only one corner of the chess board.

However, a b-tree has only one axis, hence you must chose which criteria to use first. If you chose AGE it means that starting with AGE 50, the entire chain will be scanned until the end. Only some of the records stored at the side of the chain will also qualify for WEIGHT < 100, the other records must be read but will be discarded.

So, a long story to explain why one tree can not support a query with two independent range clauses. On the other hand, one concatenated index can do the following quite well:

WHERE age = 50 AND weight < 100
WHERE weight = 100 AND age > 50
WHERE age > 50 AND age < 70;

However, the problem arises if there are two inequality operators are used on two different columns.

So, what to do?

The third possible approach is to have two independent indexes on the two columns. That allows to have as many axes as you like (just create more indexes). However, there are a few huge problems with that. First of all, not all DB products support that. Whenever it is supported, it is a rather expansive operation. It works typically that way that each index is scanned, a bitmap index is built for each result. Those bitmap indexes are then joined to apply the AND operator. That's a lot of data munging--it is only worth the effort if each condition is not very selective for it's own, but both together are very selective.

Wan't my advice?

If your query runs in an OLTP environment: use one concatenated index. Two independent indexes are an option of last resort only. However, if you are working in an OLAP environment, you might anyways need bitmap indexes.

ps.: Indexing AGE was an exercise in my book (with solution)--especially because storing AGE is a bad practice, you should store the date of birth instead.

like image 169
Markus Winand Avatar answered Oct 31 '22 18:10

Markus Winand