TSQL 'lag' analytic function - unexpectedly poor performance

In a previous job we had to compare item x with item x-1 for a lot of data (~billion rows). As this was done on SQL Server 2008 R2 we had to use a self-join. It was slow.

I thought I'd experiment with the lag function; this would be hugely valuable if fast. I found it ~ 2 to 3 of times faster but as it should be a simple operation under the hood, and as its query plan/table scanning was simpler/vastly reduced, I'm very disappointed. Code to reproduce below.

Create DB:

IF EXISTS (SELECT name 
           FROM sys.databases 
           WHERE name = N'TestDBForLag')
   DROP DATABASE TestDBForLag
GO

create database TestDBForLag
ALTER DATABASE TestDBForLag SET RECOVERY SIMPLE 
go

use TestDBForLag
go

set nocount on

create table big (g binary(16) not null)
go

begin transaction

declare @c int = 0

while @c < 100
begin
    insert into big(g) values(cast(newid() as binary(16)))
    set @c += 1
end
commit

go 10000 -- n repeats of last batch, "big" now has 1,000,000 rows

alter table big
    add constraint clustered_PK primary key clustered (g)

Queries:

set statistics time on
set statistics io on

-- new style
select  
    g, lag(g, 1) over (order by g) as xx
from big
order by g

-- old style
select  obig.g, 
(
    select max(g)
    from big as ibig
    where ibig.g < obig.g
) as xx
from big as obig
order by g

You can look at the actual/estimated query plans yourself, but here's the results of the stats (queries run twice to discount compilation time):

(1000000 row(s) affected)
Table 'Worktable'. {edit: everything zero here}.

**Table 'big'. Scan count 1, logical reads 3109**, {edit: everything else is zero here}.

SQL Server Execution Times: CPU time = 1045 ms,  elapsed time = 3516 ms.

---

(1000000 row(s) affected)

**Table 'big'. Scan count 1000001, logical reads 3190609**, {edit: everything else is zero here}.

SQL Server Execution Times:CPU time = 2683 ms,  elapsed time = 3439 ms.

So, lag takes 1 scan + 3109 reads and takes ~1 sec cpu time, a complex self-join which has to repeatedly walk the btree takes 1 million scans + 3.2 million reads takes ~2.7 secs.

I don't see any reason for this rotten performance. Any ideas?

Running on ThinkServer 140, 8G ram (so entirely mem resident), dual core, no disk contention. I'm satisfied that the time to transfer result sets to client, which is running on the same machine, is negligable.

select @@version

returns:

Microsoft SQL Server 2014 - 12.0.4213.0 (X64) Developer Edition (64-bit) 
on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)

Edit:

per @vnov's comment, I did carefully discount client overhead before I posted. I'm looking at CPU time not overall time. Test:

select *
from big

Table 'big'. Scan count 1, logical reads 3109, {rest zero}
SQL Server Execution Times: CPU time = 125 ms,  elapsed time = 2840 ms.

select count(*)
from big

Table 'big'. Scan count 1, logical reads 3109, {rest zero}
SQL Server Execution Times: CPU time = 109 ms,  elapsed time = 129 ms.

lag just should not add anything significant AFAICS, never mind an order of magnitude.

Edit2:

@Frisbee did not see why I thought lag was poor. Basically the algorithm is to remember a previous value and deliver it n rows later. If n = 1 that's even more trivial so I did some code using cursors, with and without the homemade lag, and measured. I also trivially summarised the results so it wasn't returning huge result sets, per vnov's point. Both cursor & selects gave the same results of sumg = 127539666, sumglag = 127539460. Code uses same DB + table as created above.

The select version:

select 
    sum(cast(g as tinyint)) as sumg
from (
    select g
    from big
) as xx


select 
    sum(cast(g as tinyint)) as sumg, 
    sum(cast(glag as tinyint)) as sumglag
from (
    select g, lag(g, 1) over (order by g) as glag
    from big
) as xx

I didn't do a bulk measurement but by observation the plain select vs lag here was fairly consistently ~360-400ms vs ~1700-1900ms, so 4 or 5 times slower.

For the cursors, top one emulates first select, bottom one emulates select with lag:

---------- nonlagging batch --------------
use TestDBForLag
set nocount on

DECLARE crsr CURSOR FAST_FORWARD READ_ONLY FOR 
select g from big order by g 

DECLARE @g binary(16), @sumg int = 0
OPEN crsr

FETCH NEXT FROM crsr INTO @g
WHILE (@@fetch_status <> -1)
BEGIN
    IF (@@fetch_status <> -2)
    BEGIN
        set @sumg += cast(@g as tinyint)
    END
    FETCH NEXT FROM crsr INTO @g
END

CLOSE crsr
DEALLOCATE crsr

select @sumg as sumg

go 300


---------- lagging batch --------------
use TestDBForLag
set nocount on

DECLARE crsr CURSOR FAST_FORWARD READ_ONLY FOR 
select g from big order by g

DECLARE @g binary(16), @sumg int = 0 
DECLARE @glag binary(16) = 0, @sumglag int = 0
OPEN crsr

FETCH NEXT FROM crsr INTO @g
WHILE (@@fetch_status <> -1)
BEGIN
    IF (@@fetch_status <> -2)
    BEGIN
        set @sumg += cast(@g as tinyint)
        set @sumglag += cast(@glag as tinyint)  -- the only ...
        set @glag = @g  -- ... differences
    END
    FETCH NEXT FROM crsr INTO @g
END

CLOSE crsr
DEALLOCATE crsr

select @sumg as sumg, @sumglag as sumglag

go 300

Run the above with the SQL profiler on (remove SQL:Batch Starting event), takes ~2.5 hours for me, save the trace as a table called 'trace', then run this to get average duration

-- trace save duration as microseconds, 
-- divide by 1000 to get back to milli
select 
    cast(textdata as varchar(8000)) as textdata, 
    avg(duration/1000) as avg_duration_ms
from trace
group by cast(textdata as varchar(8000))

for me the nonlagging cursor takes an average of 13.65 secs, the cursor-emulating-lag takes 16.04 secs. Most of the extra time of the latter will come from the overhead of the interpreter dealing with the extra statements (I'd expect it to be far less if implemented in C), but in any event that's less than 20% extra to calculate lag.

So, does this explanation sound reasonable, and can anyone suggest why lag is so poorly performing in a select statement?

What is lag analytic function?

LAG is an analytic function. It provides access to more than one row of a table at the same time without a self join. Given a series of rows returned from a query and a position of the cursor, LAG provides access to a row at a given physical offset prior to that position.

What is LAG () in SQL?

SQL Server LAG() is a window function that provides access to a row at a specified physical offset which comes before the current row. In other words, by using the LAG() function, from the current row, you can access data of the previous row, or the row before the previous row, and so on.

Why is MySQL Server query suddenly slow?

Here's one way to track down the cause of the problem: Find out the most expensive queries running in SQL Server, over the period of slowdown. Review the query plan and query execution statistics and wait types for the slowest query. Review the Query History over the period where performance changed.

Examine execution plans of both variants and you'll see what is going on. I use a free version of SQL Sentry Plan Explorer for this.

I'm comparing these three queries (plus one more with OUTER APPLY):

select count(*)
from big;

-- new style
select  
    g, lag(g) over (order by g) as xx
from big
order by g;

-- old style
select  obig.g, 
(
    select max(g)
    from big as ibig
    where ibig.g < obig.g
) as xx
from big as obig
order by g;

stats

1) The LAG is implemented using Window Spool, which provides twice the number of rows (1,999,999) from a temporary Worktable (it is in memory in this case, but still). The Window Spool doesn't cache all 1,000,000 rows in the Worktable, it caches only the window size.

The Window Spool operator expands each row into the set of rows that represents the window associated with it.

There are many other less heavy operators in the plan as well. The point here is that LAG is not implemented as you do it in your cursor test.

2) The plan for the old style query is pretty good. Optimizer is smart to scan the table once and do an index seek with TOP for each row to calculate MAX. Yes, it is million seeks, but everything is in memory, so it is relatively fast.

3) Hover over thick arrows between plan operators and you'll see the actual data size. It is twice as big for Window Spool. So, when everything is in memory and CPU bound, it becomes important.

4) Your old style query could be rewritten as:

select  obig.g, A.g
from big as obig
OUTER APPLY
(
    SELECT TOP(1) ibig.g
    FROM big as ibig
    WHERE ibig.g < obig.g
    ORDER BY ibig.g DESC
) AS A
order by obig.g;

, which is a bit more efficient (see the CPU column in the screenshot).

So, LAG is very efficient in the number of pages read, but uses CPU quite a lot.

TSQL 'lag' analytic function - unexpectedly poor performance

Tags:

performance

sql-server

tsql

sql-server-2014

user3779002

People also ask

1 Answers

Vladimir Baranov

Recent Activity

Donate For Us

TSQL 'lag' analytic function - unexpectedly poor performance

Tags:

performance

sql-server

tsql

sql-server-2014

user3779002

People also ask

1 Answers

Vladimir Baranov

Related questions

Recent Activity

Donate For Us