Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does columnar Databases do indexing?

Tags:

database

mysql

I understand that the columnar databases put column data together on the disk rather than rows. I also understand that in traditional row-wise RDBMS, leaf index node of B-Tree contains pointer to the actual row.

But since columnar doesn't store rows together, and they are particularly designed for columnar operations, how do they differ in the indexing techniques?

  1. Do they also use B-tress?
  2. How do they index inside whatever datastructure they use?
  3. Or there is no accepted format, every vendor have their own indexing scheme to cater their needs?

I have been searching, but unable to find any text. Every text I found is for row-wise DBMS.

like image 692
Mangat Rai Modi Avatar asked Feb 08 '18 16:02

Mangat Rai Modi


Video Answer


3 Answers

If you understand 1)How columnar DBs store the data actually, and 2)How Indexes work, (how they store the data) Then you may feel that there is no need of indexing in columnar Dbs.

For any kind of database rowid is very important, it is like the address where the data is stored. Indexing is nothing but, mapping the rowids to the columns that are being indexed in a sorted order. Columnar databases are born basing this logic. They try to store the data in this fashion itself, meaning - They store the data as a key-value pair in a serialized manner where the actual column value is Key and the rowid when the data is residing as its value and if they find any duplicates for a key, they just compress and store.

So if you compare the format how columnar databases store the data actually on Disk, it is almost the same (but not exactly because, as the difference is compression, representation of key-value in a vice versa fashion) how the row oriented databases store indexes.

That's the reason you don't need separate indexing again. and you won't find any columnar database trying to implement indexing.

like image 194
santhosh Avatar answered Oct 19 '22 23:10

santhosh


There are no BTrees. (Or, if they are, they are not the main part of the design)

Infinidb stores 64K rows per chunk. Each column in that chunk is compressed and indexed. With the chunk is a list of things like min, max, avg, etc, for each column that may or may not help in queries.

Running a SELECT first looks at that summary info for each chunk to see if the WHERE clause might be satisfied by any of the rows in the chunk.

The chunks that pass that filtering get looked at in more detail.

There is no copy of a row. Instead, if, say, you ask for SELECT a,b,c, then the compressed info for 64K rows (in one chunk) for each of a, b, c need to be decompressed to further filter and deliver the row. So, it behooves you to list only the desired columns, not blindly say SELECT *.

Since every column is separately indexed all the time, there is no need to say INDEX(a). (I don't know if INDEX(a,b) can even be specified for a columnar DB.)

Caveat: I am describing Infinidb, which is available with MariaDB. I don't know about any other columnar engines.

like image 23
Rick James Avatar answered Oct 19 '22 23:10

Rick James


Columnar Indexes (also known as "vertical data storage") stores data in a hash and compressed mode. All columns invoked in the index key are indexed separately. Hashing decrease the volume of data stored. The compressing method use only one value for repetitive occurrences (dictionnary, eventually partial).

This technic have two major difficulties :

  • First you can have collision, because a hash result can be the same for two distinct values. So the index must manage collisions.
  • Second, the hash and compress algorithms used is a very heavy consumer of resources like CPU.

Those type of indexes are stored as vectors.

Ordinary, those type of indexes are used only for read only tables, especially for the business intelligence (OLAP databases).

A columnar index can be used in a "seekable" way only for an equality predicate (COLUMN_A = OneValue). But it is also adequate for GROUPING or DISTINCT operations. Columnar index does not support range seek, including the LIKE 'foo%'.

Some database vendors have get around the huge resources needed while inserting or updating by adding some intermediate algorithms that decrease the CPU. This is the case for Microsoft SQL Server that use a delta store for newly modified rows. With this technic, the table can be used in a relational way like any classical OLTP dataabase.

For instance, Microsoft SQL Server introduced first the columnstore index in 2012 version, but this made the table read only. In 2014 the clustered columnstore index (all the columns of the table was indexed) was released and the table was writetable. And finally in the 2016 version, the columnstore index clustered ornot, no longer demands any part of the table to be read only. This was made possible because a particular search algorithm, named "Batch Mode" was developed by Microsoft Research, and does not works by reading the data row by row...

To read :

Enhancements to SQL Server Column Stores

Columnstore and B+ tree –Are Hybrid Physical Designs Important?

like image 44
SQLpro Avatar answered Oct 19 '22 22:10

SQLpro