SQL TOP 5000 faster than normal query with less than 5000 result rows?

Tags:

I noticed some strange behaviour:

Running this query:

SELECT TOP 5000  t1.f1,t1.f2,t1.f3 
FROM t1
JOIN t2 on t1.f1 = t2.f1
WHERE t2.f1 IS NOT NULL AND (t1.f5 != t2.f3)

Results in 3447 rows in 2 seconds.

Running this one:

SELECT t1.f1,t1.f2,t1.f3 
FROM t1
JOIN t2 on t1.f1 = t2.f1
WHERE t2.f1 IS NOT NULL AND (t1.f5 != t2.f3)

Runs forever until I stop it (at least 120 minutes!!).

Table t1 and t2 hold about 500k records.

I always assumed the TOP statement did not matter if the total number of rows lay below that number, however, there seems to be a very significant difference. Is this normal (if so, why) or is this just a fluke?

EDIT:

As requested:

t1:

CREATE TABLE [dbo].[t1](
    [f1] [int] NOT NULL,
    [f2] [varchar](10) NULL,
    [f3] [varchar](4) NULL,
    [f4] [int] NOT NULL,
    [f5] [varchar](max) NULL,
 CONSTRAINT [PK_t1] PRIMARY KEY CLUSTERED 
(
    [f1] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

f2:

CREATE TABLE [dbo].[t2](
    [f1] [nchar](10) NOT NULL,
    [f2] [nchar](10) NOT NULL,
    [f3] [varchar](max) NOT NULL,
    [f4] [nchar](10) NULL,
    [f5] [date] NULL,
    [f6] [date] NULL,
    [f7] [nchar](1) NULL,
 CONSTRAINT [PK_t2] PRIMARY KEY CLUSTERED 
(
    [f1] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Execution plans:

With top: Execution with top

Without top: Exec w/o top

Looking at this I would have to conclude that the sorting (WHY is it doing that??) causes the delay... Would you agree?

Edit2: as requested, execution plan with loops option without top: enter image description here

828

asked Jan 09 '13 15:01

Derk Arts

2 Answers

The problem is that your two tables, [t1] and [t2] have completely different (and largely incompatible) data types for the JOIN column, f1.

This makes it impossible for the Query Optimizer to generate an accurate estimate of how many rows are going to match between these two 500,000 row tables. It appears to be using a default "guess" that in this case is a gross over-estimate of the actual number (3477). Because of this, when you are not using the TOP, it thinks that it will be more efficient to Sort and then Merge the rows (O(NLogN)) than to do nested loops (O(N^2)), because it does not realize that the (merge) JOIN will actually eliminate almost all of the rows.

When you have the TOP 5000 on, it realizes that the Nested Loops are better, because it will get cut off at no more than 5000 (far less than 500k^2, and even less than 500k * Log(500k) ). But unlike Nested Loops, the Merge-Sort cannot be done incrementally, it has to have all of the rows for the Sort first. So cutting off the output at 5000, would not save you much at all, thus making Nested Loops clearly the better option (even with the bad JOIN estimate).

The root problem here is that the column T2.f1 is an NCHAR(10) which is a really bad choice for something that looks like it is supposed to contain an integer. The best solution would be to change that column's datatype to INT.

If for some reason you cannot do that, then depending on your version of SQL Server, you may be able to end run this by adding a persisted computed column that calculates an INT converted value of [f1] and then throw a compatible index on that. This would allow both indexing and statistics to work again for queries like this.

As a last resort, you could also use a Query Hint. I do not normally recommend them as because they tend to be stopgap solutions that cause problems later on. However, if you felt this was your only choice, then adding OPTION (FAST 1000) to the end of you query would probably work.

answered Oct 02 '22 15:10

RBarryYoung

SQL queries can be optimized in many different ways. Two common ways are "fastest first row" and "fastest last row". That is, do you want to minimize the time to get to any result or the time to get the complete result set.

I would guess that these two versions are being optimized differently. You can check this, as Aaron suggests, by looking at the execution plans. My usual bet is that the slow version uses nested loop joins. You can fix this with an optimizer hint, such as:

<your query>
option (MERGE JOIN, HASH JOIN)

There are other possibilities. Perhaps these tables are being updated, and the tables happened to have full table locks when you ran the second query. You can check this using sp_who2.

answered Oct 02 '22 16:10

Gordon Linoff

Related questions
                            
                                Execute multiple semi-colon separated query using mysql Prepared Statement
                            
                                PostgreSQL function definition in SQuirreL: unterminated dollar-quoted string
                            
                                What is the difference when comparing with parentheses: WHERE (a, b)=(1,2)
                            
                                What is the difference between setting statement fetch size in JDBC or firing a SQL query with LIMIT clause?
                            
                                testing inequality with columns that can be null
                            
                                SQL Server Query log for failed/incorrect queries?
                            
                                How to check if a column is being updated in an INSTEAD OF UPDATE Trigger
                            
                                Calling Oracle stored procedure with output parameter from SQL Server
                            
                                Dynamic SQL Parameters with Anorm and Scala Play Framework
                            
                                How to fix "Only one expression can be specified in the select list when the subquery is not introduced with EXISTS" error?
                            
                                Update VERY LARGE PostgreSQL database table efficiently
                            
                                Setting up a PHP web project, the infrastructure
                            
                                How to connect to a MySQL database from an iPhone?
                            
                                Quick SQL question: Correct syntax for creating a table with a primary key in H2?
                            
                                Best practice: best database naming convention for JPA? [closed]
                            
                                Select a portion from a MySQL Blob Field
                            
                                how much safe from SQL-Injection if using hibernate
                            
                                Query for count of distinct values in a rolling date range
                            
                                Slow query with where clause
                            
                                Get the difference between two dates both In Months and days in sql

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL TOP 5000 faster than normal query with less than 5000 result rows?

Tags:

sql

sql-server

tsql

Derk Arts

People also ask

2 Answers

RBarryYoung

Gordon Linoff

Recent Activity

Donate For Us