Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize SQL Server Columnstore Alignment

I have a Clustered Columnstore Index Table for our IOT metrics (timeseries data). It contains more than 1 billion rows and structured like this:

CREATE TABLE [dbo].[Data](
[DeviceId] [bigint] NOT NULL,
[MetricId] [smallint] NOT NULL,
[TimeStamp] [datetime2](2) NOT NULL,
[Value] [real] NOT NULL
)

CREATE CLUSTERED INDEX [PK_Data] ON [dbo].[Data] ([TimeStamp],[DeviceId],[MetricId]) --WITH (DROP_EXISTING = ON)
CREATE CLUSTERED COLUMNSTORE INDEX [PK_Data] ON [dbo].[Data] WITH (DROP_EXISTING = ON, MAXDOP = 1, DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)

There are some 10,000 distinct DeviceId values and TimeStamps range from 2008 till now. A typical query against this table looks like this:

SET STATISTICS TIME, IO ON
SELECT
    [DeviceId]
    ,[MetricId]
    ,DATEADD(hh, DATEDIFF(day, '2005-01-01', [TimeStamp]), '2005-01-01') As [Date]
    ,MIN([Value]) as [Min]
    ,MAX([Value]) as [Max]
    ,AVG([Value]) as [Avg]
    ,SUM([Value]) as [Sum]
    ,COUNT([Value]) as [Count]
FROM
    [dbo].[Data]
WHERE
    [DeviceId] = 6077129891325167032
    AND [MetricId] = 1000
    AND [TimeStamp] BETWEEN '2017-07-01' AND '2017-07-30'
GROUP BY
    [DeviceId]
    ,[MetricId]
    ,DATEDIFF(day, '2005-01-01', [TimeStamp])
ORDER BY
    [DeviceId]
    ,[MetricId]
    ,DATEDIFF(day, '2005-01-01', [TimeStamp])

When I execute this query, I get this for performance metrics:

Because at the moment a query like stated above does too many Segment reads I believe:

Table 'Data'. Scan count 2, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 5257, lob physical reads 9, lob read-ahead reads 4000.
Table 'Data'. Segment reads 11, segment skipped 764.

Query plan: Query plan

This is not well optimized I believe as there were 11 segments read to retrieve only 212 out of 1 billion source rows (before grouping/aggregation)

So then I ran Niko Neugebauer's great scripts to validate our setup and the Columnstore Alignment https://github.com/NikoNeugebauer/CISL/blob/master/Azure/alignment.sql, I get this result after rebuilding the Columnstore Clustered Index:

Columnstore Alignment

MetricId and TimeStamp columns have optimal alignment score of 100%. How can we ensure that the DeviceId column is also well aligned? I played with the column order in the initial Clustered (Rowstore) index, is that where things can be optimized?

like image 420
Ted van der Veen Avatar asked Aug 05 '17 07:08

Ted van der Veen


People also ask

Do Columnstore indexes help query performance?

Columnstore indexes achieve up to 10x greater data compression than rowstore indexes. This greatly reduces the I/O required to execute analytics queries and therefore improves query performance. Columnstore indexes read compressed data from disk, which means fewer bytes of data need to be read into memory.

What is the difference between Columnstore index and clustered index?

A nonclustered columnstore index and a clustered columnstore index function the same. The difference is that a nonclustered index is a secondary index that's created on a rowstore table, but a clustered columnstore index is the primary storage for the entire table.

When should I use Columnstore?

Consider using a clustered columnstore index when: Each partition has at least a million rows. Columnstore indexes have rowgroups within each partition. If the table is too small to fill a rowgroup within each partition, you won't get the benefits of columnstore compression and query performance.

What is Columnstore compression?

Columnstore tables and indexes are always stored with columnstore compression. You can further reduce the size of columnstore data by configuring an additional compression called archival compression. To perform archival compression, SQL Server runs the Microsoft XPRESS compression algorithm on the data.


2 Answers

the key solution to align your table by DeviceId is to build a clustered rowstore index on your table and then build over it a clustered Columnstore Index with MAXDOP = 1 (in order not to introduce any overlaps which take place when index build runs with multiple cores). So the possible code will look something like this:

CREATE CLUSTERED INDEX [PK_Data] ON [dbo].[Data] ([DeviceId],[TimeStamp],[MetricId]) --WITH (DROP_EXISTING = ON)
CREATE CLUSTERED COLUMNSTORE INDEX [PK_Data] ON [dbo].[Data] WITH (DROP_EXISTING = ON, MAXDOP = 1, DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)

Another possibility would be to do it all within CISL, by preparing and then executing the alignment funcitonality:

insert into dbo.cstore_Clustering( TableName, Partition, ColumnName )
    VALUES ('[dbo].[Data]', 1, 'DeviceId' );

This is though just for 1 partition, but you should consider partitioning your table anyway, once you get into the numbers you are using. After setting up, you can start executing dbo.cstore_doAlignment, which will automatically re-align and optimise your table. (You will have some parameters to configure the threshold of the optimisations, if you like)

Best regards, Niko

like image 94
Niko Neugebuer Avatar answered Sep 20 '22 10:09

Niko Neugebuer


When Max dop set as 1 for creating clustered column store will sort your records effectively, but for a table with 1 billion rows this max dop 1 will not help u. It is better to partition the table with any of the date columns and then create a clustered column store index with max dop any thing higher than 1 or 0. But in that case sorting will not be guaranteed but clustered column store index will do the segment elimination effectively. One should clearly noted, do not drop or create clustered column store index by keeping any other non clustered index in your table, this will impact your clustered column store index creation/drop performance. If you drop a clustered column store index by keeping other index SQL server will do lot of works on the other indexes.

like image 32
Kannan.C Avatar answered Sep 19 '22 10:09

Kannan.C