SQL Server - Speed up count on large table

Tags:

I have a table with close to 30 million records. Just several columns. One of the column 'Born' have not more than 30 different values and there is an index defined on it. I need to be able to filter on that column and efficiently page through results.

For now I have (example if the year I'm searching for is '1970' - it is a parameter in my stored procedure):

WITH PersonSubset as
(
    SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
    FROM Person WITH (INDEX(IX_Person_Born)) 
    WHERE Born = '1970'
)
SELECT *, (SELECT count(*) FROM PersonSubset) AS TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30

Every query of that sort (only Born parameter used) returns just over 1 million results. I've noticed the biggest overhead is on the count used to return the total results. If I remove (SELECT count(*) FROM PersonSubset) AS TotalPeople from the select clause the whole thing speeds up a lot.

Is there a way to speed up the count in that query. What I care about is to have the paged results returned and the total count.

262

asked Nov 29 '12 15:11

mrt

2 Answers

Updated following discussion in comments

The cause of the problem here is very low cardinality of the IX_Person_Born index.

SQL indexes are very good at quickly narrowing down values, but they have problems when you have lots of records with the same value.

You can think of it as like the index of a phone book - if you want to find "Smith, John" you first find that there are lots of names that begin with S, and then pages and pages of people called Smith, and then lots of Johns. You end up scanning the book.

This is compounded because the index in the phone book is clustered - the records are sorted by surname. If instead you want to find everyone called "John" you'll be doing a lot of looking up.

Here there are 30 million records but only 30 different values, which means that the best possible index is still returning around 1 million records - at that sort of scale it might as well be a table-scan. Each of those 1 million results is not the actual record - it's a lookup from the index to the table (the page number in the phone book analogy), which makes it even slower.

A high cardinality index (say for full date of birth), rather than year would be much quicker.

This is a general problem for all OLTP relational databases: low cardinality + huge datasets = slow queries because index-trees don't help much.

In short: there's no significantly quicker way to get the count using T-SQL and indexes.

You have a couple of options:

1. Data Aggregation

Either OLAP/Cube rollups or do it yourself:

select Born, count(*) 
from Person 
group by Born

The pro is that cube lookups or checking your cache is very fast. The problem is that the data will get out of date and you need some way to account for that.

2. Parallel Queries

Split into two queries:

SELECT count(*) 
FROM Person 
WHERE Born = '1970'

SELECT TOP 30 *
FROM Person 
WHERE Born = '1970'

Then run these either in parallel server side, or add it to the user interface.

3. No-SQL

This problem is one of the big advantages no-SQL solutions have over traditional relational databases. In a no-SQL system the Person table is federated (or sharded) across lots of cheap servers. When a user searches every server is checked at the same time.

At this point a technology change is probably out, but it may be worth investigating so I've included it.

I have had similar problems in the past with databases of this kind of size, and (depending on context) I've used both options 1 and 2. If the total here is for paging then I'd probably go with option 2 and AJAX call to get the count.

142

answered Oct 04 '22 04:10

Keith

DECLARE @TotalPeople int
  --does this query run fast enough?  If not, there is no hope for a combo query.
SET @TotalPeople = (SELECT count(*) FROM Person WHERE Born = '1970')


WITH PersonSubset as
(
    SELECT *, ROW_NUMBER() OVER (ORDER BY Born asc) AS Row
    FROM Person WITH (INDEX(IX_Person_Born)) 
    WHERE Born = '1970'
)
SELECT *, @TotalPeople as TotalPeople
FROM PersonSubset
WHERE Row BETWEEN 0 AND 30

You usually can't take a slow query, combine it with a fast query, and wind up with a fast query.

One of the column 'Born' have not more than 30 different values and there is an index defined on it.

Either SQL Server isn't using the index or statistics, or the index and statistics aren't helpful enough.

Here is a desperate measure that will force Sql's hand (at the potential cost of making writes very expensive - measure that, and blocking schema changes to the Person table while the view exists).

CREATE VIEW dbo.BornCounts WITH SCHEMABINDING
AS
SELECT Born, COUNT_BIG(*) as NumRows
FROM dbo.Person
GROUP BY Born

GO 

CREATE UNIQUE CLUSTERED INDEX BornCountsIndex ON BornCounts(Born)

By putting a clustered index on a view, you make it a system maintained copy. The size of this copy is much smaller than 30 Million rows, and it has the exact information you're looking for. I did not have to change the query to get it to use the view, but you're free to use the view's name in the query if you like.

answered Oct 04 '22 06:10

Amy B

Related questions
                            
                                SQL distinct with group by in Oracle
                            
                                MySQL retrieve latest record for Group
                            
                                Execute a custom query in nhibernate and map to a custom domain object
                            
                                How to get current GMT in Oracle
                            
                                Filtering with joined tables
                            
                                MySQL longtext analogue in Microsoft SQL?
                            
                                SqlDependency simple select query always returns "Invalid"
                            
                                How to insert binary data into sqlite3 database in bash?
                            
                                INSERT statement conflicted with COLUMN FOREIGN KEY
                            
                                SQL Server connect error: Instance failure
                            
                                Operation not allowed after ResultSet closed
                            
                                Finding the count of characters and numbers in a string
                            
                                SQL Server 2012 how do I view data?
                            
                                Confusion over the second argument of the DATEDIFF function
                            
                                Update Automation in SQL-Server
                            
                                How to create a pivot query in SQL Server with CASE Statement?
                            
                                how to mysql_fetch_array on joined tables, but columns have same name [duplicate]
                            
                                SQL JOIN table with a date range
                            
                                How to do a moving average on a range of dates in SQL?
                            
                                SQL Server and implicit conversion of types

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL Server - Speed up count on large table

Tags:

performance

sql

sql-server

mrt

People also ask