Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way of getting group ID without sorting

Imagine I have a denormalized table like so:

CREATE TABLE Persons
(
    Id           int identity primary key,
    FirstName    nvarchar(100),
    CountryName  nvarchar(100)
)

INSERT INTO Persons
VALUES ('Mark',    'Germany'),
       ('Chris',   'France'),
       ('Grace',   'Italy'),
       ('Antonio', 'Italy'),
       ('Francis', 'France'),
       ('Amanda',  'Italy');

I need to construct a query that returns the name of each person, and a unique ID for their country. The IDs do not necessarily have to be contiguous; more importantly, they do not have to be in any order. What is the most efficient way of achieving this?

The simplest solution appears to be DENSE_RANK:

SELECT FirstName, 
       CountryName, 
       DENSE_RANK() OVER (ORDER BY CountryName) AS CountryId
FROM Persons

-- FirstName  CountryName  CountryId
-- Chris      France       1
-- Francis    France       1
-- Mark       Germany      2
-- Amanda     Italy        3
-- Grace      Italy        3
-- Antonio    Italy        3

However, this incurs a sort on my CountryName column, which is a wasteful performance hog. I came up with this alternative, which uses ROW_NUMBER with the well-known trick for suppressing its sort:

SELECT P.FirstName, 
       P.CountryName,
       C.CountryId
FROM Persons P
    JOIN (
        SELECT CountryName, 
               ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS CountryId
        FROM Persons
        GROUP BY CountryName
    ) C
    ON C.CountryName = P.CountryName

-- FirstName  CountryName  CountryId
-- Mark       Germany      2
-- Chris      France       1
-- Grace      Italy        3
-- Antonio    Italy        3
-- Francis    France       1
-- Amanda     Italy        3

Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)? Are there factors that might make a difference either way (such as an index on CountryName)? Is there a more elegant way of expressing it?

like image 952
Douglas Avatar asked Dec 12 '25 19:12

Douglas


2 Answers

Why would you think that an aggregation would be cheaper than a window function? I ask, because I have some experience with both, and don't have a strong opinion on the matter. If pressed, I would guess the window function is faster, because it does not have to aggregate all the data and then join the result back in.

The two queries will have very different execution paths. The right way to see which performs better is to try it out. Run both queries on large enough samples of data in your environment.

By the way, I don't think there is a right answer, because performance depends on several factors:

  • Which columns are indexed?
  • How large is the data? Does it fit in memory?
  • How many different countries are there?

If you are concerned about performance, and just want a unique number, you could consider using checksum() instead. This does run the risk of collisions. That risk is very, very small for 200 or so countries. Plus you can test for it and do something about it if it does occur. The query would be:

SELECT FirstName, CountryName, CheckSum(CountryName) AS CountryId
FROM Persons;
like image 136
Gordon Linoff Avatar answered Dec 14 '25 07:12

Gordon Linoff


Your second query would most probably avoid sorting as it would use a hash match aggregate to build the inner query, then use a hash match join to map the ID to the actual records.

This does not sort indeed, but has to scan the original table twice.

Am I correct in assuming that the second query would perform better in general (not just on my contrived data set)?

Not necessarily. If you created a clustered index on CountryName, sorting would be a non-issue and everything would be done in a single pass.

Is there a more elegant way of expressing it?

A "correct" plan would be doing the hashing and hash lookups in one go.

Each record, as it's read, would have to be matched against the hash table. On a match, the stored ID would be returned; on a miss, the new country would be added into the hash table, assigned with new ID and that newly assigned ID would be returned.

But I can't think of a way to make SQL Server use such a plan in a single query.

Update:

If you have lots of records, few countries and, most importantly, a non-clustered index on CountryName, you could emulate loose scan to build a list of countries:

DECLARE  @country TABLE
         (
         id INT NOT NULL IDENTITY PRIMARY KEY,
         countryName VARCHAR(MAX)
         )
;

WITH    country AS
        (
        SELECT  TOP 1
                countryName
        FROM    persons
        ORDER BY
                countryName
        UNION ALL
        SELECT  (
                SELECT  countryName
                FROM    (
                        SELECT  countryName,
                                ROW_NUMBER() OVER (ORDER BY countryName) rn
                        FROM    persons
                        WHERE   countryName > country.countryName
                        ) q
                WHERE   rn = 1
                )
        FROM    country
        WHERE   countryName IS NOT NULL
        )
INSERT
INTO    @country (countryName)
SELECT  countryName
FROM    country
WHERE   countryName IS NOT NULL
OPTION  (MAXRECURSION 0)

SELECT  p.firstName, c.id
FROM    persons p
JOIN    @country c
ON      c.countryName = p.countryName
like image 29
Quassnoi Avatar answered Dec 14 '25 09:12

Quassnoi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!