Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big-Oh Performance of an Inner Join on Two Indexes

I am trying to figure out the Big-Oh performance of the following query:

SELECT * 
FROM table1 INNER JOIN table2 ON table1.a = table2.b
GROUP BY table1.a

table1.a is the primary key of the table. table2.b has a non-unique index on it.

My thought is since each index can be searched in O(log n), then this query performs in O(log n * log m) where n is the number of rows in table 1 and m is the number of rows in table 2.

Any input would be appreciated.

like image 629
tlovett1 Avatar asked May 16 '13 15:05

tlovett1


People also ask

Do indexes improve join performance?

Indexes can help improve the performance of a nested-loop join in several ways. The biggest benefit often comes when you have a clustered index on the joining column in one of the tables. The presence of a clustered index on a join column frequently determines which table SQL Server chooses as the inner table.

What is the time complexity of inner join?

If both tables that are sorted according to the keys that are being used in the join, then the query will have a time complexity of O(M+N). If both tables have an index on the joined columns, then the index already maintains those columns in order, and there's no need to sort. The complexity will be O(M + N).

What is the most efficient way of joining 2 table in same database?

Method 1: Relational Algebra Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.


2 Answers

Your thinking is a bit off. An index can be searched in O(log n) for a single lookup. Your query would presumably be doing "n" or "m" of these.

Let me assume that the query processes by joining the two tables together by scanning one table and looking up the values in the other. It then does sort-based aggregation for the order by.

The "matching" piece of the query is then the larger of:

  • O(n log m)
  • O(m log n)

This assumes that the query engine decides to scan one table and look up values in the index in the other.

To continue, once you look up values, you need to fetch the data in the pages for the table where you used the index. Fetching data is technically O(n). Our estimate so far is O((n log m) + n + ).

The aggregation should be O(n log n) for a sort followed by a scan. But, how many records do you have for the aggregation? You could have as many as n*m matches to the join. Or, as few as 0 (it is an inner join).

This is big-O, which is an upper bound. So, we have to use the bigger estimate. This results in O((n*m)log(n*m)) for the aggregation, which would dominate other terms. The big-O would be O((n*m) log(n*m)).

like image 175
Gordon Linoff Avatar answered Oct 19 '22 07:10

Gordon Linoff


The performance of the query depends on how the SQL statement is executed internally.

Maybe you could look into EXPLAIN (for MySQL: http://dev.mysql.com/doc/refman/5.1/en/explain.html) here to get more info on how your query gets executed as this could yield more accurate results than looking at Big-Oh.

Btw: Gordon Linoff's answer looks good if you're really looking for Big-Oh!

like image 28
lorey Avatar answered Oct 19 '22 06:10

lorey