Imagine I have a denormalized table like so:
CREATE TABLE Persons
(
Id int identity primary key,
FirstName nvarchar(100),
CountryName nvarchar(100)
)
INSERT INTO Persons
VALUES ('Mark', 'Germany'),
('Chris', 'France'),
('Grace', 'Italy'),
('Antonio', 'Italy'),
('Francis', 'France'),
('Amanda', 'Italy');
I need to construct a query that returns the name of each person, and a unique ID for their country. The IDs do not necessarily have to be contiguous; more importantly, they do not have to be in any order. What is the most efficient way of achieving this?
The simplest solution appears to be DENSE_RANK:
SELECT FirstName,
CountryName,
DENSE_RANK() OVER (ORDER BY CountryName) AS CountryId
FROM Persons
-- FirstName CountryName CountryId
-- Chris France 1
-- Francis France 1
-- Mark Germany 2
-- Amanda Italy 3
-- Grace Italy 3
-- Antonio Italy 3
However, this incurs a sort on my CountryName column, which is a wasteful performance hog. I came up with this alternative, which uses ROW_NUMBER with the well-known trick for suppressing its sort:
SELECT P.FirstName,
P.CountryName,
C.CountryId
FROM Persons P
JOIN (
SELECT CountryName,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CountryId
FROM Persons
GROUP BY CountryName
) C
ON C.CountryName = P.CountryName
-- FirstName CountryName CountryId
-- Mark Germany 2
-- Chris France 1
-- Grace Italy 3
-- Antonio Italy 3
-- Francis France 1
-- Amanda Italy 3
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)? Are there factors that might make a difference either way (such as an index on CountryName)? Is there a more elegant way of expressing it?
Why would you think that an aggregation would be cheaper than a window function? I ask, because I have some experience with both, and don't have a strong opinion on the matter. If pressed, I would guess the window function is faster, because it does not have to aggregate all the data and then join the result back in.
The two queries will have very different execution paths. The right way to see which performs better is to try it out. Run both queries on large enough samples of data in your environment.
By the way, I don't think there is a right answer, because performance depends on several factors:
If you are concerned about performance, and just want a unique number, you could consider using checksum() instead. This does run the risk of collisions. That risk is very, very small for 200 or so countries. Plus you can test for it and do something about it if it does occur. The query would be:
SELECT FirstName, CountryName, CheckSum(CountryName) AS CountryId
FROM Persons;
Your second query would most probably avoid sorting as it would use a hash match aggregate to build the inner query, then use a hash match join to map the ID to the actual records.
This does not sort indeed, but has to scan the original table twice.
Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)?
Not necessarily. If you created a clustered index on CountryName, sorting would be a non-issue and everything would be done in a single pass.
Is there a more elegant way of expressing it?
A "correct" plan would be doing the hashing and hash lookups in one go.
Each record, as it's read, would have to be matched against the hash table. On a match, the stored ID would be returned; on a miss, the new country would be added into the hash table, assigned with new ID and that newly assigned ID would be returned.
But I can't think of a way to make SQL Server use such a plan in a single query.
Update:
If you have lots of records, few countries and, most importantly, a non-clustered index on CountryName, you could emulate loose scan to build a list of countries:
DECLARE @country TABLE
(
id INT NOT NULL IDENTITY PRIMARY KEY,
countryName VARCHAR(MAX)
)
;
WITH country AS
(
SELECT TOP 1
countryName
FROM persons
ORDER BY
countryName
UNION ALL
SELECT (
SELECT countryName
FROM (
SELECT countryName,
ROW_NUMBER() OVER (ORDER BY countryName) rn
FROM persons
WHERE countryName > country.countryName
) q
WHERE rn = 1
)
FROM country
WHERE countryName IS NOT NULL
)
INSERT
INTO @country (countryName)
SELECT countryName
FROM country
WHERE countryName IS NOT NULL
OPTION (MAXRECURSION 0)
SELECT p.firstName, c.id
FROM persons p
JOIN @country c
ON c.countryName = p.countryName
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With