I am looking for suggestions to optimize this query that already runs for over an hour with about 300,000 rows in the table. We are using a reporting tool that requires data to be in this shape when it's pulled, so re-designing table structure is not an option. The table looks like this:
CREATE TABLE [datatable](
[RowID] [int] IDENTITY(1,1) NOT NULL,
[CampaignID] [int] NOT NULL,
[CampaignName] [nvarchar](255) NULL,
[Category] [nvarchar](255) NOT NULL,
[PostID] [int] NOT NULL,
[TopicName] [nvarchar](4000) NULL,
[TopicFrequency] [int] NULL
)
Data is being constantly added to the table, so I have to periodically update topic frequencies. Here is my current query:
UPDATE datatable
SET TopicFrequency = b.TopicFrequency
FROM datatable INNER JOIN
(SELECT CampaignID, Category, TopicName, COUNT(DISTINCT PostID) AS TopicFrequency
FROM datatable GROUP BY CampaignID, Category, TopicName) AS b
ON datatable.CampaignID = b.CampaignID
AND datatable.Category = b.Category
AND datatable.TopicName = b.TopicName
With topic name being nvarchar 4000 I can't create an index on the field. Looking for ideas. Thanks.
General decision - is to split your table into two or more tables - ie - normalize the data structure. I think that two more tables can be introduced - for Campaigns and Topics
BUT
For your present data structures you can create an uniqueidentifier or bigint computed column as hash of TopicName field, index it and look for hash instead of string field. I'll provide you an example with bigint:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[HashString64SVF](@input NVARCHAR(4000))
RETURNS BIGINT
WITH SCHEMABINDING, RETURNS NULL ON NULL INPUT
AS
BEGIN
RETURN
CAST(SUBSTRING(HASHBYTES('SHA1', UPPER(@Input)), 1, 8) AS BIGINT)
^ CAST(SUBSTRING(HASHBYTES('SHA1', UPPER(@Input)), 9, 8) AS BIGINT)
^ CAST(SUBSTRING(HASHBYTES('SHA1', UPPER(@Input)), 17, 4) AS BIGINT)
END
GO
ALTER TABLE datatable ADD TopicNameHash AS dbo.HashString64SVF(TopicName)
GO
CREATE INDEX NewIndexName ON DataTable(TopicNameHash, CampaignID, Category) INCLUDE(PostId)
GO
UPDATE datatable
SET TopicFrequency = b.TopicFrequency
FROM datatable
JOIN
(SELECT CampaignID, Category, TopicNameHash, COUNT(DISTINCT PostID) AS TopicFrequency
FROM datatable GROUP BY CampaignID, Category, TopicNameHash) AS b
ON datatable.CampaignID = b.CampaignID
AND datatable.Category = b.Category
AND datatable.TopicNameHash = b.TopicNameHash
AND
Create a primary key on RowId column
AND
Recreate the table in way like this:
CREATE TABLE [datatable](
[RowID] [int] IDENTITY(1,1) PRIMARY KEY,
[CampaignID] [int] NOT NULL,
[Category] [nvarchar](255) NOT NULL,
[PostID] [int] NOT NULL,
--uncomment if needed [TopicNameHash] AS dbo.HashString64SVF(TopicName),
[TopicFrequency] [int] NULL,
[CampaignName] [nvarchar](255) NULL,
[TopicName] [nvarchar](4000) NULL
)
The main reason - if your nullable variable columns is in the end of columns list and there are many NULL values in these columns - sql server can save a little space in row and thus - in IO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With