Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hash Join vs. Nested Loop

I'm new to SQL and I have a question.

I have this SQL code:

DROP TABLE if exists students;
DROP TABLE if exists grades;

CREATE TABLE students(
    s_id integer NOT NULL PRIMARY KEY,
    s_name text,
    s_last_name text,
    curr_year integer
);

CREATE TABLE grades(
    s_id integer NOT NULL PRIMARY KEY,
    course text,
    c_year integer,
    grade integer,
    FOREIGN KEY (s_id) REFERENCES students
);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (1, 'A', 'S', 3);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (2, 'A', 'A', 2);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (3, 'V', 'B', 1);

INSERT INTO students (s_id, s_name, s_last_name, curr_year)
VALUES (4, 'K', 'N', 2);

INSERT INTO grades (s_id, course, c_year, grade)
VALUES (1, 'DB', 2, 98);

INSERT INTO grades (s_id, course, c_year, grade)
VALUES (2, 'OS', 3, 90);

INSERT INTO grades (s_id, course, c_year, grade)
VALUES (3, 'DB', 2, 94);

EXPLAIN ANALYZE
    SELECT * 
    FROM students s JOIN grades gr
    ON s.s_id = gr.s_id
    WHERE curr_year > 0;



CREATE INDEX student_details ON students(s_id, s_name, s_last_name);
CREATE INDEX student_courses ON grades(s_id, course);

EXPLAIN ANALYZE
    SELECT * 
    FROM students s JOIN grades gr
    ON s.s_id = gr.s_id
    WHERE curr_year > 0;


DROP INDEX student_details;
DROP INDEX student_courses;
DROP TABLE students CASCADE;
DROP TABLE grades CASCADE;

And I try to understand the explain outputs. Before the INDEX creating I got hash merge. Here is the explain output:

 Hash Join  (cost=23.50..51.74 rows=270 width=116) (actual time=0.039..0.050 rows=3 loops=1)
   Hash Cond: (gr.s_id = s.s_id)
   ->  Seq Scan on grades gr  (cost=0.00..21.30 rows=1130 width=44) (actual time=0.005..0.008 rows=3 loops=1)
   ->  Hash  (cost=20.12..20.12 rows=270 width=72) (actual time=0.021..0.021 rows=4 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on students s  (cost=0.00..20.12 rows=270 width=72) (actual time=0.006..0.011 rows=4 loops=1)
               Filter: (curr_year > 0)
 Planning time: 0.147 ms
 Execution time: 0.089 ms
(9 rows)

And after the INDEX creating I get Nested Loop:

Nested Loop  (cost=0.00..2.12 rows=1 width=116) (actual time=0.031..0.116 rows=3 loops=1)
   Join Filter: (s.s_id = gr.s_id)
   Rows Removed by Join Filter: 9
   ->  Seq Scan on students s  (cost=0.00..1.05 rows=1 width=72) (actual time=0.012..0.018 rows=4 loops=1)
         Filter: (curr_year > 0)
   ->  Seq Scan on grades gr  (cost=0.00..1.03 rows=3 width=44) (actual time=0.003..0.009 rows=3 loops=4)
 Planning time: 0.396 ms
 Execution time: 0.170 ms

But I can't figure out why? Why the indexing in my case made Nested Loop be preferred over Hash Join? I'll be very happy to get an explanation for that.

Thanks a lot!

like image 729
Vipasana Avatar asked Jan 15 '18 15:01

Vipasana


People also ask

What are the differences between hash join merge join and nested loops?

Nested Loops are used to join smaller tables. Further, nested loop join uses during the cross join and table variables. Merge Joins are used to join sorted tables. This means that Merge joins are utilized when join columns are indexed in both tables while Hash Match join uses a hash table to join equi joins.

What is the difference between merge join and hash join?

Merge join is used when projections of the joined tables are sorted on the join columns. Merge joins are faster and uses less memory than hash joins. Hash join is used when projections of the joined tables are not already sorted on the join columns.

Why is hash join faster?

The HASH join might be faster than a SORT-MERGE join, in this case, because only one row source needs to be sorted, and it could possibly be faster than a NESTED LOOPS join because probing a hash table in memory can be faster than traversing a b-tree index.

Is hash join good?

Hash join is best algorithm when large, unsorted, and non-indexed data (residing in tables) is to be joined. Hash join algorithm consists of probe phase and build phase.


1 Answers

"Nested loop" join is a bit of a misnomer. Technically it refers to a nested loop -- as its name implies:

for every row in table1
    for every row in table2
        compare values and execute join

In practice, this often works as:

for every row in table1
    for every matching row in table2
        execute join

The difference is subtle, but the "matching" means that the nested loop join can make use of an index. So, a nested loop join can have very poor performance (if the tables are relatively large and there are no indexes) or it can have really good performance (if it can make use of an index).

like image 119
Gordon Linoff Avatar answered Oct 14 '22 08:10

Gordon Linoff