In a previous job we had to compare item x with item x-1 for a lot of data (~billion rows). As this was done on SQL Server 2008 R2 we had to use a self-join. It was slow.
I thought I'd experiment with the lag function; this would be hugely valuable if fast. I found it ~ 2 to 3 of times faster but as it should be a simple operation under the hood, and as its query plan/table scanning was simpler/vastly reduced, I'm very disappointed. Code to reproduce below.
Create DB:
IF EXISTS (SELECT name
FROM sys.databases
WHERE name = N'TestDBForLag')
DROP DATABASE TestDBForLag
GO
create database TestDBForLag
ALTER DATABASE TestDBForLag SET RECOVERY SIMPLE
go
use TestDBForLag
go
set nocount on
create table big (g binary(16) not null)
go
begin transaction
declare @c int = 0
while @c < 100
begin
insert into big(g) values(cast(newid() as binary(16)))
set @c += 1
end
commit
go 10000 -- n repeats of last batch, "big" now has 1,000,000 rows
alter table big
add constraint clustered_PK primary key clustered (g)
Queries:
set statistics time on
set statistics io on
-- new style
select
g, lag(g, 1) over (order by g) as xx
from big
order by g
-- old style
select obig.g,
(
select max(g)
from big as ibig
where ibig.g < obig.g
) as xx
from big as obig
order by g
You can look at the actual/estimated query plans yourself, but here's the results of the stats (queries run twice to discount compilation time):
(1000000 row(s) affected)
Table 'Worktable'. {edit: everything zero here}.
**Table 'big'. Scan count 1, logical reads 3109**, {edit: everything else is zero here}.
SQL Server Execution Times: CPU time = 1045 ms, elapsed time = 3516 ms.
---
(1000000 row(s) affected)
**Table 'big'. Scan count 1000001, logical reads 3190609**, {edit: everything else is zero here}.
SQL Server Execution Times:CPU time = 2683 ms, elapsed time = 3439 ms.
So, lag
takes 1 scan + 3109 reads and takes ~1 sec cpu time, a complex self-join which has to repeatedly walk the btree takes 1 million scans + 3.2 million reads takes ~2.7 secs.
I don't see any reason for this rotten performance. Any ideas?
Running on ThinkServer 140, 8G ram (so entirely mem resident), dual core, no disk contention. I'm satisfied that the time to transfer result sets to client, which is running on the same machine, is negligable.
select @@version
returns:
Microsoft SQL Server 2014 - 12.0.4213.0 (X64) Developer Edition (64-bit)
on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)
Edit:
per @vnov's comment, I did carefully discount client overhead before I posted. I'm looking at CPU time not overall time. Test:
select *
from big
Table 'big'. Scan count 1, logical reads 3109, {rest zero}
SQL Server Execution Times: CPU time = 125 ms, elapsed time = 2840 ms.
select count(*)
from big
Table 'big'. Scan count 1, logical reads 3109, {rest zero}
SQL Server Execution Times: CPU time = 109 ms, elapsed time = 129 ms.
lag
just should not add anything significant AFAICS, never mind an order of magnitude.
Edit2:
@Frisbee did not see why I thought lag was poor. Basically the algorithm is to remember a previous value and deliver it n rows later. If n = 1 that's even more trivial so I did some code using cursors, with and without the homemade lag, and measured. I also trivially summarised the results so it wasn't returning huge result sets, per vnov's point. Both cursor & selects gave the same results of sumg = 127539666, sumglag = 127539460. Code uses same DB + table as created above.
The select version:
select
sum(cast(g as tinyint)) as sumg
from (
select g
from big
) as xx
select
sum(cast(g as tinyint)) as sumg,
sum(cast(glag as tinyint)) as sumglag
from (
select g, lag(g, 1) over (order by g) as glag
from big
) as xx
I didn't do a bulk measurement but by observation the plain select vs lag here was fairly consistently ~360-400ms vs ~1700-1900ms, so 4 or 5 times slower.
For the cursors, top one emulates first select, bottom one emulates select with lag:
---------- nonlagging batch --------------
use TestDBForLag
set nocount on
DECLARE crsr CURSOR FAST_FORWARD READ_ONLY FOR
select g from big order by g
DECLARE @g binary(16), @sumg int = 0
OPEN crsr
FETCH NEXT FROM crsr INTO @g
WHILE (@@fetch_status <> -1)
BEGIN
IF (@@fetch_status <> -2)
BEGIN
set @sumg += cast(@g as tinyint)
END
FETCH NEXT FROM crsr INTO @g
END
CLOSE crsr
DEALLOCATE crsr
select @sumg as sumg
go 300
---------- lagging batch --------------
use TestDBForLag
set nocount on
DECLARE crsr CURSOR FAST_FORWARD READ_ONLY FOR
select g from big order by g
DECLARE @g binary(16), @sumg int = 0
DECLARE @glag binary(16) = 0, @sumglag int = 0
OPEN crsr
FETCH NEXT FROM crsr INTO @g
WHILE (@@fetch_status <> -1)
BEGIN
IF (@@fetch_status <> -2)
BEGIN
set @sumg += cast(@g as tinyint)
set @sumglag += cast(@glag as tinyint) -- the only ...
set @glag = @g -- ... differences
END
FETCH NEXT FROM crsr INTO @g
END
CLOSE crsr
DEALLOCATE crsr
select @sumg as sumg, @sumglag as sumglag
go 300
Run the above with the SQL profiler on (remove SQL:Batch Starting event), takes ~2.5 hours for me, save the trace as a table called 'trace', then run this to get average duration
-- trace save duration as microseconds,
-- divide by 1000 to get back to milli
select
cast(textdata as varchar(8000)) as textdata,
avg(duration/1000) as avg_duration_ms
from trace
group by cast(textdata as varchar(8000))
for me the nonlagging cursor takes an average of 13.65 secs, the cursor-emulating-lag takes 16.04 secs. Most of the extra time of the latter will come from the overhead of the interpreter dealing with the extra statements (I'd expect it to be far less if implemented in C), but in any event that's less than 20% extra to calculate lag.
So, does this explanation sound reasonable, and can anyone suggest why lag is so poorly performing in a select statement?
LAG is an analytic function. It provides access to more than one row of a table at the same time without a self join. Given a series of rows returned from a query and a position of the cursor, LAG provides access to a row at a given physical offset prior to that position.
SQL Server LAG() is a window function that provides access to a row at a specified physical offset which comes before the current row. In other words, by using the LAG() function, from the current row, you can access data of the previous row, or the row before the previous row, and so on.
Here's one way to track down the cause of the problem: Find out the most expensive queries running in SQL Server, over the period of slowdown. Review the query plan and query execution statistics and wait types for the slowest query. Review the Query History over the period where performance changed.
Examine execution plans of both variants and you'll see what is going on. I use a free version of SQL Sentry Plan Explorer for this.
I'm comparing these three queries (plus one more with OUTER APPLY
):
select count(*)
from big;
-- new style
select
g, lag(g) over (order by g) as xx
from big
order by g;
-- old style
select obig.g,
(
select max(g)
from big as ibig
where ibig.g < obig.g
) as xx
from big as obig
order by g;
1) The LAG
is implemented using Window Spool, which provides twice the number of rows (1,999,999) from a temporary Worktable (it is in memory in this case, but still). The Window Spool doesn't cache all 1,000,000 rows in the Worktable, it caches only the window size.
The Window Spool operator expands each row into the set of rows that represents the window associated with it.
There are many other less heavy operators in the plan as well. The point here is that LAG
is not implemented as you do it in your cursor test.
2) The plan for the old style query is pretty good. Optimizer is smart to scan the table once and do an index seek with TOP
for each row to calculate MAX
. Yes, it is million seeks, but everything is in memory, so it is relatively fast.
3) Hover over thick arrows between plan operators and you'll see the actual data size. It is twice as big for Window Spool. So, when everything is in memory and CPU bound, it becomes important.
4) Your old style query could be rewritten as:
select obig.g, A.g
from big as obig
OUTER APPLY
(
SELECT TOP(1) ibig.g
FROM big as ibig
WHERE ibig.g < obig.g
ORDER BY ibig.g DESC
) AS A
order by obig.g;
, which is a bit more efficient (see the CPU column in the screenshot).
So, LAG
is very efficient in the number of pages read, but uses CPU quite a lot.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With