I noticed some strange behaviour:
Running this query:
SELECT TOP 5000 t1.f1,t1.f2,t1.f3
FROM t1
JOIN t2 on t1.f1 = t2.f1
WHERE t2.f1 IS NOT NULL AND (t1.f5 != t2.f3)
Results in 3447 rows in 2 seconds.
Running this one:
SELECT t1.f1,t1.f2,t1.f3
FROM t1
JOIN t2 on t1.f1 = t2.f1
WHERE t2.f1 IS NOT NULL AND (t1.f5 != t2.f3)
Runs forever until I stop it (at least 120 minutes!!).
Table t1
and t2
hold about 500k records.
I always assumed the TOP
statement did not matter if the total number of rows lay below that number, however, there seems to be a very significant difference. Is this normal (if so, why) or is this just a fluke?
EDIT:
As requested:
t1:
CREATE TABLE [dbo].[t1](
[f1] [int] NOT NULL,
[f2] [varchar](10) NULL,
[f3] [varchar](4) NULL,
[f4] [int] NOT NULL,
[f5] [varchar](max) NULL,
CONSTRAINT [PK_t1] PRIMARY KEY CLUSTERED
(
[f1] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
f2:
CREATE TABLE [dbo].[t2](
[f1] [nchar](10) NOT NULL,
[f2] [nchar](10) NOT NULL,
[f3] [varchar](max) NOT NULL,
[f4] [nchar](10) NULL,
[f5] [date] NULL,
[f6] [date] NULL,
[f7] [nchar](1) NULL,
CONSTRAINT [PK_t2] PRIMARY KEY CLUSTERED
(
[f1] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Execution plans:
With top:
Without top:
Looking at this I would have to conclude that the sorting
(WHY is it doing that??) causes the delay... Would you agree?
Edit2: as requested, execution plan with loops option without top:
In Postgres (and probably any RDBMS to a similar extent, MySQL to a lesser extent), fewer queries are almost always much faster. The overhead of parsing and planning multiple queries is already more than any possible gain in most cases.
The SQL ORDER BY Keyword The ORDER BY keyword is used to sort the result-set in ascending or descending order. The ORDER BY keyword sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.
The problem is that your two tables, [t1] and [t2] have completely different (and largely incompatible) data types for the JOIN column, f1
.
This makes it impossible for the Query Optimizer to generate an accurate estimate of how many rows are going to match between these two 500,000 row tables. It appears to be using a default "guess" that in this case is a gross over-estimate of the actual number (3477). Because of this, when you are not using the TOP, it thinks that it will be more efficient to Sort and then Merge the rows (O(NLogN)) than to do nested loops (O(N^2)), because it does not realize that the (merge) JOIN will actually eliminate almost all of the rows.
When you have the TOP 5000 on, it realizes that the Nested Loops are better, because it will get cut off at no more than 5000 (far less than 500k^2, and even less than 500k * Log(500k) ). But unlike Nested Loops, the Merge-Sort cannot be done incrementally, it has to have all of the rows for the Sort first. So cutting off the output at 5000, would not save you much at all, thus making Nested Loops clearly the better option (even with the bad JOIN estimate).
The root problem here is that the column T2.f1 is an NCHAR(10) which is a really bad choice for something that looks like it is supposed to contain an integer. The best solution would be to change that column's datatype to INT.
If for some reason you cannot do that, then depending on your version of SQL Server, you may be able to end run this by adding a persisted computed column that calculates an INT converted value of [f1] and then throw a compatible index on that. This would allow both indexing and statistics to work again for queries like this.
As a last resort, you could also use a Query Hint. I do not normally recommend them as because they tend to be stopgap solutions that cause problems later on. However, if you felt this was your only choice, then adding OPTION (FAST 1000)
to the end of you query would probably work.
SQL queries can be optimized in many different ways. Two common ways are "fastest first row" and "fastest last row". That is, do you want to minimize the time to get to any result or the time to get the complete result set.
I would guess that these two versions are being optimized differently. You can check this, as Aaron suggests, by looking at the execution plans. My usual bet is that the slow version uses nested loop joins. You can fix this with an optimizer hint, such as:
<your query>
option (MERGE JOIN, HASH JOIN)
There are other possibilities. Perhaps these tables are being updated, and the tables happened to have full table locks when you ran the second query. You can check this using sp_who2
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With