1)Are SQL query execution times O(n) compared to the number of joins, if indexes are not used? If not, what kind of relationship are we likely to expect? And can indexing improve the actual big-O time-complexity, or does it only reduce the entire query time by some constant factor? Slightly vague question, I'm sure it varies a lot but I'm talking in a general sense here. 2) If you have a query like: <pre class="prettyprint"><code>SELECT T1.name, T2.date FROM T1, T2 WHERE T1.id=T2.id AND T1.color='red' AND T2.type='CAR' </code></pre> Am I right assuming the DB will do single table filtering first on T1.color and T2.type, before evaluating multi-table conditions? In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?

This depends on the query plan used. Even without indexes, modern servers can use <code>HASH JOIN</code> and <code>MERGE JOIN</code> which are faster than <code>O(N * M)</code> More specifically, complexity of a <code>HASH JOIN</code> is <code>O(N + M)</code>, where <code>N</code> is the hashed table and <code>M</code> the is lookup table. Hashing and hash lookups have constant complexity. Complexity of a <code>MERGE JOIN</code> is <code>O(N*Log(N) + M*Log(M))</code>: it's the sum of times to sort both tables plus time to scan them. <pre class="prettyprint"><code>SELECT T1.name, T2.date FROM T1, T2 WHERE T1.id=T2.id AND T1.color='red' AND T2.type='CAR' </code></pre> If there are no indexes defined, the engine will select either a <code>HASH JOIN</code> or a <code>MERGE JOIN</code>. The <code>HASH JOIN</code> works as follows: <ol> <li>The hashed table is chosen (usually it's the table with fewer records). Say it's <code>t1</code></li> <li>All records from <code>t1</code> are scanned. If the records holds <code>color='red'</code>, this record goes into the hash table with <code>id</code> as a key and <code>name</code> as a value.</li> <li>All records from <code>t2</code> are scanned. If the record holds <code>type='CAR'</code>, its <code>id</code> is searched in the hash table and the values of <code>name</code> from all hash hits are returned along with the current value of <code>data</code>.</li> </ol> The <code>MERGE JOIN</code> works as follows: <ol> <li>The copy of <code>t1 (id, name)</code> is created, sorted on <code>id</code></li> <li>The copy of <code>t2 (id, data)</code> is created, sorted on <code>id</code></li> <li> The pointers are set to the minimal values in both tables: <pre class="prettyprint"><code>>1 2< 2 3 2 4 3 5 </code></pre> </li> <li> The pointers are compared in a loop, and if they match, the records are returned. If they don't match, the pointer with the minimal value is advanced: <pre class="prettyprint"><code>>1 2< - no match, left pointer is less. Advance left pointer 2 3 2 4 3 5 1 2< - match, return records and advance both pointers >2 3 2 4 3 5 1 2 - match, return records and advance both pointers 2 3< 2 4 >3 5 1 2 - the left pointer is out of range, the query is over. 2 3 2 4< 3 5 > </code></pre> </li> </ol> <blockquote> In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests? </blockquote> Sure. Your query without the <code>WHERE</code> clause: <pre class="prettyprint"><code>SELECT T1.name, T2.date FROM T1, T2 </code></pre> is more simple but returns more results and runs longer.

Is there any general rule on SQL query complexity Vs performance?

Tags:

performance

big-o

sql

1)Are SQL query execution times O(n) compared to the number of joins, if indexes are not used? If not, what kind of relationship are we likely to expect? And can indexing improve the actual big-O time-complexity, or does it only reduce the entire query time by some constant factor?

Slightly vague question, I'm sure it varies a lot but I'm talking in a general sense here.

2) If you have a query like:

SELECT  T1.name, T2.date FROM    T1, T2 WHERE   T1.id=T2.id         AND T1.color='red'         AND T2.type='CAR'

Am I right assuming the DB will do single table filtering first on T1.color and T2.type, before evaluating multi-table conditions? In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?

735

asked Jan 14 '10 16:01

Mr. Boy

2 Answers

This depends on the query plan used.

Even without indexes, modern servers can use HASH JOIN and MERGE JOIN which are faster than O(N * M)

More specifically, complexity of a HASH JOIN is O(N + M), where N is the hashed table and M the is lookup table. Hashing and hash lookups have constant complexity.

Complexity of a MERGE JOIN is O(N*Log(N) + M*Log(M)): it's the sum of times to sort both tables plus time to scan them.

SELECT  T1.name, T2.date FROM    T1, T2 WHERE   T1.id=T2.id         AND T1.color='red'         AND T2.type='CAR'

If there are no indexes defined, the engine will select either a HASH JOIN or a MERGE JOIN.

The HASH JOIN works as follows:

The hashed table is chosen (usually it's the table with fewer records). Say it's t1
All records from t1 are scanned. If the records holds color='red', this record goes into the hash table with id as a key and name as a value.
All records from t2 are scanned. If the record holds type='CAR', its id is searched in the hash table and the values of name from all hash hits are returned along with the current value of data.

The MERGE JOIN works as follows:

The copy of t1 (id, name) is created, sorted on id
The copy of t2 (id, data) is created, sorted on id
The pointers are set to the minimal values in both tables:
```
>1 2< 2 3 2 4 3 5 
```

The pointers are compared in a loop, and if they match, the records are returned. If they don't match, the pointer with the minimal value is advanced:

>1  2<  - no match, left pointer is less. Advance left pointer  2  3  2  4  3  5   1  2<  - match, return records and advance both pointers >2  3  2  4  3  5   1  2  - match, return records and advance both pointers  2  3<   2  4 >3  5   1  2 - the left pointer is out of range, the query is over.  2  3  2  4<  3  5 >

In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?

Sure.

Your query without the WHERE clause:

SELECT  T1.name, T2.date FROM    T1, T2

is more simple but returns more results and runs longer.

105

answered Sep 20 '22 04:09

Quassnoi

Be careful of conflating too many different things. You have a logical cost of the query based on number of rows to be examined, a (possibly) smaller logical cost based on number of rows actually returned and an unrelated a physical cost based on number of pages that have to be examined.

The three are related, but not strongly.

The number of rows examined is the largest of these costs and least easy to control. The rows have to be matched through the join algorithm. This, also, is the least relevant.

The number of rows returned is more costly because that's I/O bandwidth between client application and database.

The number of pages read is the most costly because that's an even larger number of physical I/O's. That's the most costly because that's load inside the database with impact on all clients.

SQL Query with one table is O( n ). That's the number of rows. It's also O( p ) based on the number of pages.

With more than one table, the rows examined is O(nm...). That's the nested-loops algorithm. Depending on the cardinality of the relationship, however, the result set may be as small as O( n ) because the relationships are all 1:1. But each table must be examined for matching rows.

A Hash Join replaces O( n*log(n) ) index + table reads with O( n ) direct hash lookups. You still have to process O( n ) rows, but you bypass some index reads.

A Merge Join replaces O( nm ) nested loops with O( log(n+m)(n+m) ) sort operation.

With indexes, the physical cost may be reduced to O(log(n)m) if a table is merely checked for existence. If rows are required, then the index speeds access to the rows, but all matching rows must be processed. O(nm) because that's the size of the result set, irrespective of indexes.

The pages examined for this work may be smaller, depending on the selectivity of the index.

The point of an index isn't to reduce the number of rows examined so much. It's to reduce the physical I/O cost of fetching the rows.

answered Sep 23 '22 04:09

S.Lott

Related questions
                            
                                In TSQL, how to evaluate an expression and assign it to a BIT field?
                            
                                How to pass in parameters to a SQL Server script called with sqlcmd?
                            
                                Natural join in SQL Server
                            
                                Has anyone had any success in unit testing SQL stored procedures?
                            
                                Recover sa password [closed]
                            
                                Are there multiline comment delimiters in SQL that are vendor agnostic?
                            
                                Count Number of Consecutive Occurrence of values in Table
                            
                                Execute a stored procedure in another stored procedure in SQL server
                            
                                Access to Result sets from within Stored procedures Transact-SQL SQL Server
                            
                                Finding rows with same values in multiple columns
                            
                                What does "%Type" mean in Oracle sql?
                            
                                How does the GROUP BY clause manage the NULL values?
                            
                                sql 2005 - The column was specified multiple times
                            
                                can we have a foreign key which is not a primary key in any other table?
                            
                                SQL - STDEVP or STDEV and how to use it?
                            
                                What does "<>" mean in Oracle
                            
                                How to set a maximum execution time for a mysql query?
                            
                                Concatenate Message In RAISERROR
                            
                                SQLite: COUNT slow on big tables
                            
                                sqlalchemy,creating an sqlite database if it doesn't exist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With