Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should Join Tables typically be created as Index Organized Tables (Clustered Indexes)?

Generally speaking ... should join tables (i.e. associative tables) be created as Index Organized Tables (Oracle) , Clustered Indexes (SQL Server) .... or plain old heap tables (with separate indexes on the 2 columns).

The way I see if, the advantages are:

Speed improvement. You're avoiding a heap table look up.

Space Improvement. You're eliminating the heap table altogether, so you're probably saving ~30% space.

The disadvantages:

Index Skip Scan (only applies to Oracle) .. will be faster then a Full Table Scan, but slower then an Index Scan. So searches on the second column of the compound key will be slightly slower (Oracle), much slower (MSSQL).

A Full Index Scan will be slower then a Full Table Scan - so if most of the time the Cost Based Optimizer is doing Hash Joins (which don't take advantage of Indexes) ... you could expect worse performance. (Assuming that the RDBMS doesn't first filter the tables).

Which makes me question whether any type of indexes are really required for Join Tables, if you're predominately going to be doing Hash Joins.

like image 913
vicsz Avatar asked Jan 01 '12 21:01

vicsz


People also ask

Are indexes good for joins?

Indexes can help improve the performance of a nested-loop join in several ways. The biggest benefit often comes when you have a clustered index on the joining column in one of the tables. The presence of a clustered index on a join column frequently determines which table SQL Server chooses as the inner table.

Why only one clustered index can be created on a table?

There can be only one clustered index per table, because the data rows themselves can be stored in only one order. The only time the data rows in a table are stored in sorted order is when the table contains a clustered index. When a table has a clustered index, the table is called a clustered table.

What is clustered index organization?

A clustered index is a type of index where the table records are physically re-ordered to match the index. Clustered indexes are efficient on columns that are searched for a range of values.


2 Answers

My personal rule-of-thumb is to create two-table associative entities as index-organized-tables, with the primary key constraint being the access "direction" I expect to be more commonly used. I'll then generally add a unique index to cover reverse order of the keys, so in all cases the optimizer should be able to use unique-scan or range-scan access.

Three-table (or more) associative entities generally require significantly more analysis.

Also, the optimizer will use indexes with hash join operations; generally fast full scans, but indexes nonetheless.

like image 166
Adam Musch Avatar answered Sep 20 '22 16:09

Adam Musch


I'd just list and talk through a few possible solutions, which hopefully will help you decide. A "union table" contains two or three columns. A foreign key to the left table, say a, and a foreign key to the right table, say b. The optional column is the row identity for the "union table", say id.

Solution 1: Columns a,b. No clustered index (a heap), indexes on (a,b) and (b,a)
Both columns are stored in three places. It supports seeks on both a and b, and the seek for b does not require a bookmark lookup, since a part of the (b,a) index. Decent choice, but the triple storage seems like a waste. The heap has no use but has to be maintained during insert and update queries.

Solution 2: Columns a, b. Clustered index on (a,b), index on (b,a)
All data is stored twice. Can serve seeks on a and b without a bookmark lookup. This would be the best practice approach. It trades disk storage for speed.

Solution 3: Columns a, b. Clustered index on (a,b)
All data is stored only once. It can serve a seek on a, but not on b. Going from the right to the left table will require a table scan. This trades speed for disk space. (Your question mentions hash join. A hash join always does a full scan.)

Solution 4: Columns id, a, b. Clustered index (id), index on (a) and (b)
Seeks on a or b both require a bookmark lookup. Both a and b are stored twice on disk, once in their own index and once in the clustered key. This is the worst solution I could think of.

This list is by no means exhaustive. Solution 2 would be a good default choice. I'd go for that unless another solution proved itself to be significantly better in tests.

like image 35
Andomar Avatar answered Sep 18 '22 16:09

Andomar