Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indexes on join tables

When searching on Google for join table indexes, I got this question.

Now, I believe that it is giving some false information in the accepted answer, or I do not understand how everything works. Given the following tables (running on PostGreSQL 9.4):

CREATE TABLE "albums" ("album_id" serial PRIMARY KEY, "album_name" text)
CREATE TABLE "artists" ("artist_id" serial PRIMARY KEY, "artist_name" text)
CREATE TABLE "albums_artists" ("album_id" integer REFERENCES "albums", "artist_id" integer REFERENCES "artists")

I was trying to replicate the scenario from the question mentioned above, by creating first an index on both of the columns of the albums_artists table and then one index for each column (without keeping the index on both columns).

I would have been expecting very different results when using the EXPLAIN command for a normal, traditional select like the following one:

SELECT "artists".* FROM "test"."artists"
    INNER JOIN "test"."albums_artists" ON ("albums_artists"."artist_id" = "artists"."artist_id")
    WHERE ("albums_artists"."album_id" = 1)

However, when actually running explain on it, I get exactly the same result for each of the cases (with one index on each column vs. one index on both columns).

I've been reading the documentation on PostGreSQL about indexing and it doesn't make any sense on the results that I am getting:

Hash Join  (cost=15.05..42.07 rows=11 width=36) (actual time=0.024..0.025 rows=1 loops=1)
  Hash Cond: (artists.artist_id = albums_artists.artist_id)
  ->  Seq Scan on artists  (cost=0.00..22.30 rows=1230 width=36) (actual time=0.006..0.006 rows=1 loops=1)
  ->  Hash  (cost=14.91..14.91 rows=11 width=4) (actual time=0.009..0.009 rows=1 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 1kB
        ->  Bitmap Heap Scan on albums_artists  (cost=4.24..14.91 rows=11 width=4) (actual time=0.008..0.009 rows=1 loops=1)
              Recheck Cond: (album_id = 1)
              Heap Blocks: exact=1
              ->  Bitmap Index Scan on albums_artists_album_id_index  (cost=0.00..4.24 rows=11 width=0) (actual time=0.005..0.005 rows=1 loops=1)
                    Index Cond: (album_id = 1)

I would expect to not get an Index scan at the last step when using an index composed by 2 different columns (since I am only using one of them in the WHERE clause).

I was about to open a bug in an ORM library that adds one index for both columns for join tables, but now I am not so sure. Can anyone help me understand why is the behavior similar in the two cases and what would actually be the difference, if there is any?

like image 396
Florin Asăvoaie Avatar asked Dec 05 '15 22:12

Florin Asăvoaie


People also ask

Do indexes work on joins?

Indexes can help improve the performance of a nested-loop join in several ways. The biggest benefit often comes when you have a clustered index on the joining column in one of the tables. The presence of a clustered index on a join column frequently determines which table SQL Server chooses as the inner table.

How do you create an index in joins?

A star-join index is one in which a single table at the center of the star is joined to multiple tables in a one-to-many relationship. To define a star-join index, you must define single-column key and primary keys, and then use the key join syntax in the CREATE JOIN INDEX statement.

What is index in joining?

Advertisements. JOIN INDEX is a materialized view. Its definition is permanently stored and the data is updated whenever the base tables referred in the join index is updated. JOIN INDEX may contain one or more tables and also contain pre-aggregated data. Join indexes are mainly used for improving the performance.

Does index improve join performance?

Indexes slows performance of mass inserts, it is usually advised to drop and index before you load a table and build it after the load is complete. In some databases, index structure can get corrupted. Index performance depends on key data type and length.


1 Answers

  • add a NOT NULL constraint on the key columns (a tuple with NULLs would make no sense here)
  • add a PRIMARY KEY (forcing a UNIQUE index on the two keyfields)
  • As a suport for FK lookups : add a compound index for the PK fields in reversed order
  • after creating/adding PKs and indexes, you may want to ANALYZE the table (only key columns have statistics)

CREATE TABLE albums_artists
    ( album_id integer NOT NULL REFERENCES albums (album_id)
    , artist_id integer NOT NULL REFERENCES artists (artist_id)
    , PRIMARY KEY (album_id, artist_id)
    );

CREATE UNIQUE INDEX ON albums_artists (artist_id, album_id);

The reason behind the observed behaviour is the fact that the planner/optimiser is information based, driven by heuristics. Without any information about the fraction of rows that will actually be needed given the conditions, or the fraction of rows that actually maches (in case of a JOIN), the planner makes a guess: (for example: 10% for a range query). For a small query, a hash join will always be a winning scenario, it does imply fetching all tuples from both tables, but the join itself is very efficient.

For columns that are part of a key or index, statistics will be collected, so the planner can make more realistic estimates about the amount of rows involved. Ald that will often result in an indexed plan, since that could need fewer pages to be fetched.

Foreign keys are a very special case; since the planner will know that all the values from the referring table will be present in the referred table. (that is 100%, assuming NOT NULL)

like image 98
wildplasser Avatar answered Sep 28 '22 02:09

wildplasser