Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Refactoring a tsql view which uses row_number() to return rows with a unique column value

I have a sql view, which I'm using to retrieve data. Lets say its a large list of products, which are linked to the customers who have bought them. The view should return only one row per product, no matter how many customers it is linked to. I'm using the row_number function to achieve this. (This example is simplified, the generic situation would be a query where there should only be one row returned for each unique value of some column X. Which row is returned is not important)

CREATE VIEW productView AS
SELECT * FROM 
    (SELECT 
        Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
        customer.Id
        //various other columns
    FROM products
    LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
    //various other joins
    ) as temp
WHERE temp.prodcut_numbering = 1

Now lets say that the total number of rows in this view is ~1 million, and running select * from productView takes 10 seconds. Performing a query such as select * from productView where productID = 10 takes the same amount of time. I believe this is because the query gets evaluated to this

SELECT * FROM 
    (SELECT 
        Row_number() OVER(PARTITION BY products.Id ORDER BY products.Id) AS product_numbering,
        customer.Id
        //various other columns
    FROM products
    LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
    //various other joins
    ) as temp
WHERE prodcut_numbering = 1 and prodcut.Id = 10

I think this is causing the inner subquery to be evaluated in full each time. Ideally I'd like to use something along the following lines

SELECT 
    Row_number() OVER(PARTITION BY products.productID ORDER BY products.productID) AS product_numbering,
    customer.id
    //various other columns
FROM products
    LEFT OUTER JOIN customer ON customer.productId = prodcut.Id
    //various other joins
WHERE prodcut_numbering = 1

But this doesn't seem to be allowed. Is there any way to do something similar?

EDIT -

After much experimentation, the actual problem I believe I am having is how to force a join to return exactly 1 row. I tried to use outer apply, as suggested below. Some sample code.

CREATE TABLE Products (id int not null PRIMARY KEY)
CREATE TABLE Customers (
        id int not null PRIMARY KEY,
        productId int not null,
        value varchar(20) NOT NULL)

declare @count int = 1
while @count <= 150000
begin
        insert into Customers (id, productID, value)
        values (@count,@count/2, 'Value ' + cast(@count/2 as varchar))      
        insert into Products (id) 
        values (@count)
        SET @count = @count + 1
end

CREATE NONCLUSTERED INDEX productId ON Customers (productID ASC)

With the above sample set, the 'get everything' query below

select * from Products
outer apply (select top 1 * 
            from Customers
            where Products.id = Customers.productID) Customers

takes ~1000ms to run. Adding an explicit condition:

select * from Products
outer apply (select top 1 * 
            from Customers
            where Products.id = Customers.productID) Customers
where Customers.value = 'Value 45872'

Takes some identical amount of time. This 1000ms for a fairly simple query is already too much, and scales the wrong way (upwards) when adding additional similar joins.

like image 963
John Avatar asked Oct 18 '11 11:10

John


People also ask

What does ROW_NUMBER () do in SQL?

ROW_NUMBER function is a SQL ranking function that assigns a sequential rank number to each new record in a partition. When the SQL Server ROW NUMBER function detects two identical values in the same partition, it assigns different rank numbers to both.

How the ROW_NUMBER function can return non deterministic result?

If there are duplicate tuples for the combination of partitioning and order by columns list, then the function can assign the row numbers in any order for such duplicates. This can eventually lead to a non-deterministic result.

Is ROW_NUMBER faster than distinct?

In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach.

Can you use ROW_NUMBER in WHERE clause?

The ROW_NUMBER function cannot currently be used in a WHERE clause. Derby does not currently support ORDER BY in subqueries, so there is currently no way to guarantee the order of rows in the SELECT subquery.


2 Answers

Try the following approach, using a Common Table Expression (CTE). With the test data you provided, it returns specific ProductIds in less than a second.

create view ProductTest as 

with cte as (
select 
    row_number() over (partition by p.id order by p.id) as RN, 
    c.*
from 
    Products p
    inner join Customers c
        on  p.id = c.productid
)

select * 
from cte
where RN = 1
go

select * from ProductTest where ProductId = 25
like image 95
Derek Kromm Avatar answered Sep 27 '22 16:09

Derek Kromm


What if you did something like:

SELECT ...
FROM products
OUTER APPLY (SELECT TOP 1 * from customer where customerid = products.buyerid) as customer
...

Then the filter on productId should help. It might be worse without filtering, though.

like image 44
GilM Avatar answered Sep 27 '22 18:09

GilM