Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most performant way in SQL Server to condense multiple data changes into before and after values

Tags:

sql

sql-server

I have a SQL Server database with some audit records showing changes to a third party database (OpenEdge). I have no control over the structure of the audit data, nor the way the third party database audits data changes. So I'm left with, for example, the following data...

source dataset

If you follow the first five rows you can see they all belong to TransId 1532102 (represents a database transaction) where the TransSeq represents a database action within a single transaction.

In the columns prefix New the audit changes are visible. If the value is NULL then no change to that field took place.

Looking at the data you can see that where TransId = 1532102 the PrimaryIdentifier is changed from 2 to -2 (row 1), then from -2 to 3 (row 3), then from 3 to 4 (row 4) and finally from 4 to 5 (row 5). You might also notice that when the PrimaryIdentifier changes from 3 to 4 the SecondaryIdentifier changes from 'abcd' to 'efgh' (row 4). So these multiple changes are actually only occurring on a single source record. So with this in mind rows 1, 3, 4 & 5 can all be condensed into a single row (see below)

enter image description here

Ultimately there are only two record changes in TransId 1532102..

enter image description here

I need to translate these changes into a single UPDATE statement on a target database. In order to do this I need to ensure I have a single record showing the before and after values.

So given the source data presented here I need to produce the following data set..

required dataset

What query structures could I use to achieve this? I was thinking recursive CTEs or perhaps using Hierarchical structures? Ultimately I need this to perform as well as possible so I wanted to pose the question here in case I hadn't considered all possible approaches.

Thoughts welcome and here's a script for the sample data

DECLARE @TestTable TABLE (SyncId INT, TransId INT, TransSeq INT, PrimaryIdentifier INT, SecondaryIdentifier NCHAR(4), NewPrimaryIdentifier INT, NewSecondaryIdentifier NCHAR(4), NewLevel INT, NewValue NVARCHAR(20))
INSERT  @TestTable
        SELECT 128, 1532102, 0,  2, 'abcd',   -2,   NULL, NULL, 'test data'
UNION   SELECT 128, 1532102, 1,  3, 'abcd',    2,   NULL, NULL, NULL
UNION   SELECT 128, 1532102, 2, -2, 'abcd',    3,   NULL, NULL, NULL
UNION   SELECT 128, 1532102, 3,  3, 'abcd',    4, 'efgh', NULL, NULL
UNION   SELECT 128, 1532102, 4,  4, 'efgh',    5,   NULL,    2, NULL
UNION   SELECT 128, 1532102, 5,  5, 'efgh', NULL, 'ghfi', NULL, NULL
UNION   SELECT 128, 1532106, 0,  3, 'abcd',   -3,   NULL, NULL, NULL
UNION   SELECT 128, 1532106, 1,  4, 'abcd',    3,   NULL, NULL, NULL
UNION   SELECT 128, 1532106, 2, -3, 'abcd',    4,   NULL, NULL, NULL
UNION   SELECT 128, 1532110, 0,  4, 'abcd',   -4,   NULL, NULL, NULL
UNION   SELECT 128, 1532110, 1,  5, 'abcd',    4,   NULL, NULL, NULL
UNION   SELECT 128, 1532110, 2, -4, 'abcd',    5,   NULL, NULL, NULL
UNION   SELECT 128, 1532114, 0,  5, 'abcd',   -5,   NULL, NULL, NULL
UNION   SELECT 128, 1532114, 1,  4, 'abcd',    5,   NULL,    1, NULL
UNION   SELECT 128, 1532114, 2, -5, 'abcd',    4,   NULL, NULL, 'some more test data'

SELECT  *
FROM    @TestTable

EDIT: I've actually been unable to write any queries that successfully track the identifier changes. Can anyone help - I need a query that tracks the changes in PrimaryIdentifier values and ultimately provides a single record for each tracking with start values and end values.

EDIT 2: There's been a deleted answer that suggests the update to the key identifiers is not possible when condensed and that I should step through the changes instead. I thought it would be valuable to add my comments for further info to the question..

I need to condense the dataset because of the volume of audit records being generated; most of which are unecessary because of the way the source DBMS makes its changes. I need to reduce the dataset and I need to track key identifier changes. The update should be possible without clashing on id change during the update statement - see this example.

like image 495
Drammy Avatar asked Nov 13 '16 13:11

Drammy


People also ask

What makes data retrieval faster in SQL?

You can retrieve data faster and optimize SQL queries by using clustered and non-clustered SQL Server indexes. Indexes can reduce runtime, but it's also important to consider how much disk space they require. In Microsoft SQL Server, you won't need additional disk space for your clustered indexes.

How can we improve the performance of update query in SQL Server?

Best practices to improve SQL update statement performance We need to consider the lock escalation mode of the modified table to minimize the usage of too many resources. Analyzing the execution plan may help to resolve performance bottlenecks of the update query. We can remove the redundant indexes on the table.

Which data type is faster in SQL?

To speed up frequent sorts, use an int (or an integer-based) data type if possible. SQL Server sorts integer data faster than character data.


2 Answers

I assume that
1) (PrimaryIdentifier, SecondaryIdentifier) is a PK of the target table,
2) Every transacton in the audit table leaves target table in a consistent state. So the update of the PK in a single statement for every transaction using case will run OK:

declare @t table (id int primary key, old int);
insert @t(id, old) values (4,4),(5,5);
update @t set id = case id 
     when 4 then 5 
     when 5 then 4 end;
select * from @t;

The plan is 1. Condense transactions 2. Generate update sql into temp table. Then you can run all or selected items from the temp table. Every item is of the form

UPDATE myTable SET 
         PrimaryIdentifier = CASE WHEN PrimaryIdentifier=2 AND SecondaryIdentifier='abcd' THEN 5 
                                  WHEN PrimaryIdentifier=3 AND SecondaryIdentifier='abcd' THEN 2 END,  
        SecondaryIdentifier = CASE WHEN PrimaryIdentifier=2 AND SecondaryIdentifier='abcd' THEN 'efgh' 
                                   WHEN PrimaryIdentifier=3 AND SecondaryIdentifier='abcd' THEN 'abcd' END , 
        Level= CASE WHEN PrimaryIdentifier=2 AND SecondaryIdentifier='abcd' THEN 2 
                    WHEN PrimaryIdentifier=3 AND SecondaryIdentifier='abcd' THEN  Level  END , 
        Value= CASE WHEN PrimaryIdentifier=2 AND SecondaryIdentifier='abcd' THEN 'test data' 
                    WHEN PrimaryIdentifier=3 AND SecondaryIdentifier='abcd' THEN  Value  END
WHERE 1=2 OR (PrimaryIdentifier=2 AND SecondaryIdentifier='abcd') 
          OR (PrimaryIdentifier=3 AND SecondaryIdentifier='abcd')

The query

DECLARE @TestTable TABLE (SyncId INT, TransId INT, TransSeq INT, PrimaryIdentifier INT, SecondaryIdentifier NCHAR(4), NewPrimaryIdentifier INT, NewSecondaryIdentifier NCHAR(4), NewLevel INT, NewValue NVARCHAR(20))
INSERT  @TestTable
        SELECT 128, 1532102, 0,  2, 'abcd', -2, NULL,   NULL,   'test data'
UNION   SELECT 128, 1532102, 1,  3, 'abcd',  2, NULL,   NULL,   NULL
UNION   SELECT 128, 1532102, 2, -2, 'abcd',  3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532102, 3,  3, 'abcd',  4, 'efgh', NULL,   NULL
UNION   SELECT 128, 1532102, 4,  4, 'efgh',  5, NULL,   2,      NULL
UNION   SELECT 128, 1532106, 0,  3, 'abcd', -3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532106, 1,  4, 'abcd',  3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532106, 2, -3, 'abcd',  4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 0,  4, 'abcd', -4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 1,  5, 'abcd',  4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 2, -4, 'abcd',  5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 3,  5, 'abcd',  6, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 4,  6, 'abcd',  5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532114, 0,  5, 'abcd', -5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532114, 1,  4, 'abcd',  5, NULL,   1,      NULL
UNION   SELECT 128, 1532114, 2, -5, 'abcd',  4, NULL,   NULL,   'some more test data'
;
WITH root AS (
    -- Top parent updates within transactions
    SELECT SyncId, TransId, TransSeq, PrimaryIdentifier AS rPrimaryIdentifier, SecondaryIdentifier AS rSecondaryIdentifier, 
    NewPrimaryIdentifier, 
    coalesce(NewSecondaryIdentifier, SecondaryIdentifier) AS NewSecondaryIdentifier,
    newLevel, NewValue
    FROM  @TestTable t
    WHERE NOT EXISTS (SELECT 1 
                   FROM  @TestTable t2 
                   WHERE t2.SyncId=t.SyncId AND t2.TransId = t.TransId
                       AND t2.TransSeq < t.TransSeq 
                       AND t.PrimaryIdentifier = t2.NewPrimaryIdentifier
                       AND t.SecondaryIdentifier = coalesce(t2.NewSecondaryIdentifier, t2.SecondaryIdentifier) 
                   )
    -- recursion to track the chain of updates
    UNION ALL
    SELECT root.SyncId, root.TransId, t.TransSeq, rPrimaryIdentifier, rSecondaryIdentifier,
         t.NewPrimaryIdentifier,
         coalesce(t.NewSecondaryIdentifier, root.NewSecondaryIdentifier),
         coalesce(root.NewLevel, t.NewLevel), coalesce(root.NewValue, t.NewValue)
    FROM root 
    JOIN @TestTable t ON root.SyncId=t.SyncId AND root.TransId = t.TransId
                       AND root.TransSeq < t.TransSeq 
                       AND t.PrimaryIdentifier = root.NewPrimaryIdentifier
                       AND t.SecondaryIdentifier = root.NewSecondaryIdentifier

)
,condensed as (
    -- last update in the chain
    SELECT TOP(1) WITH TIES *  
    FROM root
    ORDER BY row_number() over (partition by SyncId, TransId, rPrimaryIdentifier, rSecondaryIdentifier 
                                order by TransSeq desc)
)
-- generate sql
SELECT SyncId, TransId, sql = 'UPDATE myTable SET PrimaryIdentifier = CASE'

    + (SELECT ' WHEN PrimaryIdentifier='+ CAST(rPrimaryIdentifier as varchar(20)) 
             +' AND SecondaryIdentifier=''' + rSecondaryIdentifier 
             +''' THEN ' + CAST(NewPrimaryIdentifier as varchar(20))             
        FROM condensed c2 
        WHERE c1.SyncId = c2.SyncId AND  c1.TransId= c2.TransId
        FOR XML PATH('') ) 
    + ' END,  SecondaryIdentifier = CASE'
    + (SELECT ' WHEN PrimaryIdentifier='+ CAST(rPrimaryIdentifier as varchar(20)) 
             +' AND SecondaryIdentifier=''' + rSecondaryIdentifier
             +''' THEN ''' + NewSecondaryIdentifier + ''''
        FROM condensed c2 
        WHERE c1.SyncId = c2.SyncId AND  c1.TransId= c2.TransId
        FOR XML PATH('') )
    + ' END , Level= CASE'
    + (SELECT ' WHEN PrimaryIdentifier='+ CAST(rPrimaryIdentifier as varchar(20)) 
             +' AND SecondaryIdentifier=''' + rSecondaryIdentifier
             +''' THEN ' 
             + CASE WHEN NewLevel IS NULL THEN ' Level ' ELSE CAST(NewLevel  as varchar(20)) END 
        FROM condensed c2 
        WHERE c1.SyncId = c2.SyncId AND  c1.TransId= c2.TransId
        FOR XML PATH('') )
    + ' END , Value= CASE'
    + (SELECT ' WHEN PrimaryIdentifier='+ CAST(rPrimaryIdentifier as varchar(20)) 
             +' AND SecondaryIdentifier=''' + rSecondaryIdentifier
             +''' THEN ' 
             + CASE WHEN NewValue IS NULL THEN ' Value ' ELSE '''' + NewValue + '''' END 
        FROM condensed c2 
        WHERE c1.SyncId = c2.SyncId AND  c1.TransId= c2.TransId
        FOR XML PATH('') )
     + ' END'
     + ' WHERE 1=2'
     + (SELECT ' OR (PrimaryIdentifier='+ CAST(rPrimaryIdentifier as varchar(20)) 
         +' AND SecondaryIdentifier=''' + rSecondaryIdentifier +''')'
    FROM condensed c2 
    WHERE c1.SyncId = c2.SyncId AND  c1.TransId= c2.TransId
    FOR XML PATH('') )
INTO #UpdSql    
FROM condensed c1 
GROUP BY SyncId, TransId


SELECT * 
FROM #UpdSql
ORDER BY SyncId, TransId

EDIT

Taking into account NewPrimaryIdentifier can be NULL too. See added row at @TestTable. Sql generation skipped.

DECLARE @TestTable TABLE (SyncId INT, TransId INT, TransSeq INT, PrimaryIdentifier INT, SecondaryIdentifier NCHAR(4), NewPrimaryIdentifier INT, NewSecondaryIdentifier NCHAR(4), NewLevel INT, NewValue NVARCHAR(20))
INSERT  @TestTable
        SELECT 128, 1532102, 0,  2, 'abcd', -2, NULL,   NULL,   'test data'
UNION   SELECT 128, 1532102, 1,  3, 'abcd',  2, NULL,   NULL,   NULL
UNION   SELECT 128, 1532102, 2, -2, 'abcd',  3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532102, 3,  3, 'abcd',  4, 'efgh', NULL,   NULL
UNION   SELECT 128, 1532102, 4,  4, 'efgh',  5, NULL,   2,      NULL
UNION   SELECT 128, 1532102, 5,  5, 'efgh', null, 'ghfi', null, NULL -- added
UNION   SELECT 128, 1532106, 0,  3, 'abcd', -3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532106, 1,  4, 'abcd',  3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532106, 2, -3, 'abcd',  4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 0,  4, 'abcd', -4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 1,  5, 'abcd',  4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 2, -4, 'abcd',  5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 3,  5, 'abcd',  6, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 4,  6, 'abcd',  5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532114, 0,  5, 'abcd', -5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532114, 1,  4, 'abcd',  5, NULL,   1,      NULL
UNION   SELECT 128, 1532114, 2, -5, 'abcd',  4, NULL,   NULL,   'some more test data'
;
WITH root AS (
    -- Top parent updates within transactions
    SELECT SyncId, TransId, TransSeq, PrimaryIdentifier AS rPrimaryIdentifier, SecondaryIdentifier AS rSecondaryIdentifier, 
    coalesce(NewPrimaryIdentifier, PrimaryIdentifier) AS NewPrimaryIdentifier,
    coalesce(NewSecondaryIdentifier, SecondaryIdentifier) AS NewSecondaryIdentifier,
    newLevel, NewValue
    FROM  @TestTable t
    WHERE NOT EXISTS (SELECT 1 
                   FROM  @TestTable t2 
                   WHERE t2.SyncId=t.SyncId AND t2.TransId = t.TransId
                       AND t2.TransSeq < t.TransSeq 
                       AND t.PrimaryIdentifier = coalesce(t2.NewPrimaryIdentifier, t2.PrimaryIdentifier)
                       AND t.SecondaryIdentifier = coalesce(t2.NewSecondaryIdentifier, t2.SecondaryIdentifier) 
                   )
    -- recursion to track the chain of updates
    UNION ALL
    SELECT root.SyncId, root.TransId, t.TransSeq, rPrimaryIdentifier, rSecondaryIdentifier,
         coalesce(t.NewPrimaryIdentifier, root.NewPrimaryIdentifier),
         coalesce(t.NewSecondaryIdentifier, root.NewSecondaryIdentifier),
         coalesce(t.NewLevel, root.NewLevel), coalesce(t.NewValue, root.NewValue)
    FROM root 
    JOIN @TestTable t ON root.SyncId=t.SyncId AND root.TransId = t.TransId
                       AND root.TransSeq < t.TransSeq 
                       AND t.PrimaryIdentifier = root.NewPrimaryIdentifier
                       AND t.SecondaryIdentifier = root.NewSecondaryIdentifier

)
,condensed as (
    -- last update in the chain
    SELECT TOP(1) WITH TIES *  
    FROM root
    ORDER BY row_number() over (partition by SyncId, TransId, rPrimaryIdentifier, rSecondaryIdentifier 
                                order by TransSeq desc)
)
SELECT * 
FROM condensed 
ORDER BY SyncId, TransId, rPrimaryIdentifier, rSecondaryIdentifier
like image 82
Serg Avatar answered Oct 20 '22 05:10

Serg


Here is a second stab at producing the originally asked for output. This time using a bunch of CTE:s.

DECLARE @TestTable TABLE (SyncId INT, TransId INT, TransSeq INT, PrimaryIdentifier INT, SecondaryIdentifier NCHAR(4), NewPrimaryIdentifier INT, NewSecondaryIdentifier NCHAR(4), NewLevel INT, NewValue NVARCHAR(20))

INSERT  @TestTable
        SELECT 128, 1532102, 0,  2, 'abcd', -2, NULL,   NULL,   'test data'
UNION   SELECT 128, 1532102, 1,  3, 'abcd',  2, NULL,   NULL,   NULL
UNION   SELECT 128, 1532102, 2, -2, 'abcd',  3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532102, 3,  3, 'abcd',  4, 'efgh', NULL,   NULL
UNION   SELECT 128, 1532102, 4,  4, 'efgh',  5, NULL,   2,      NULL
UNION   SELECT 128, 1532106, 0,  3, 'abcd', -3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532106, 1,  4, 'abcd',  3, NULL,   NULL,   NULL
UNION   SELECT 128, 1532106, 2, -3, 'abcd',  4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 0,  4, 'abcd', -4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 1,  5, 'abcd',  4, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 2, -4, 'abcd',  5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 3,  5, 'abcd',  6, NULL,   NULL,   NULL
UNION   SELECT 128, 1532110, 4,  6, 'abcd',  5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532114, 0,  5, 'abcd', -5, NULL,   NULL,   NULL
UNION   SELECT 128, 1532114, 1,  4, 'abcd',  5, NULL,   1,      NULL
UNION   SELECT 128, 1532114, 2, -5, 'abcd',  4, NULL,   NULL,   'some more test data'


;with baseCTE as (
    select SyncId, TransId, TransSeq, PrimaryIdentifier, SecondaryIdentifier,
            isnull(NewPrimaryIdentifier, PrimaryIdentifier) as NewPrimaryIdentifier,
            isnull(NewSecondaryIdentifier, SecondaryIdentifier) as NewSecondaryIdentifier,
            NewLevel, NewValue
    from @TestTable
),
syncTransEntryPointsCte as (
    select *
    from baseCTE b
    where not exists(
        select *
        from baseCTE subb
        where b.SyncId = subb.SyncId
            and b.TransId = subb.TransId
            and b.PrimaryIdentifier = subb.NewPrimaryIdentifier
            and b.SecondaryIdentifier = subb.NewSecondaryIdentifier
            and b.TransSeq > subb.TransSeq
    )
)
, recursiveBaseCte as (
    select *, 0 as lev, TransSeq as OrigTransSec from syncTransEntryPointsCte

    union all 

    select 
        c.SyncId, c.TransId, c.TransSeq, p.PrimaryIdentifier, p.SecondaryIdentifier, c.NewPrimaryIdentifier, c.NewSecondaryIdentifier, isnull(c.NewLevel, p.NewLevel), isnull(c.NewValue, p.NewValue),
        p.lev + 1,
        p.OrigTransSec
    from baseCTE c
        join recursiveBaseCte as p on (
            c.SyncId = p.SyncId and c.TransId = p.TransId and c.PrimaryIdentifier = p.NewPrimaryIdentifier and c.SecondaryIdentifier = p.NewSecondaryIdentifier and c.TransSeq > p.TransSeq
        )
)
select r.SyncId, r.TransId, r.OrigTransSec as TransSec, 
    r.PrimaryIdentifier, r.SecondaryIdentifier, 
    nullif(r.NewPrimaryIdentifier, r.PrimaryIdentifier) as NewPrimaryIdentifier,
    nullif(r.NewSecondaryIdentifier, r.SecondaryIdentifier) as NewSecondaryIdentifier,
    r.NewLevel, r.NewValue
from recursiveBaseCte r
    join (
        select SyncId, TransId, PrimaryIdentifier, SecondaryIdentifier, max(lev) as mlev 
        from recursiveBaseCte 
        group by SyncId, TransId, PrimaryIdentifier, SecondaryIdentifier
    ) as selectForOutput on 
        r.SyncId = selectForOutput.SyncId
        and r.TransId = selectForOutput.TransId
        and r.PrimaryIdentifier = selectForOutput.PrimaryIdentifier
        and r.SecondaryIdentifier = selectForOutput.SecondaryIdentifier
        and r.lev = selectForOutput.mlev
order by 1,2,3

Whether or not the CTE approach is any faster than the cursor based one is difficult to guess. I do suggest you test run this at a suitable time when the server in question is not under heavy load.

Update

The script first declares the baseCTE which is used just to make sure that we have values in NewPrimaryIdentifier and NewSecondaryIdentifier for each row, even if one or both of them were not changed in the update. This makes everything after that easier since we can then join to the next row for the same combination within a specific transaction.

The syncTransEntryPointCte in turn uses baseCTE to find all rows within one transaction that were not preceded by another row within the same transaction.

recursiveBaseCte then uses both of the previous CTE:s to recursively find rows and aggregate changes. The final query then uses it to produce the final output.

The output should be usable for updating a stale copy of the source table if you can manage to do the updates for one condensed transaction in one update statement. If, as I originally assumed, you try to build one update statement for each row in the condensed audit output, it will not work.

Finally, obligatory disclaimer: This seems to work with the test data you gave in the question. I can give no guarantees that it works for the real thing, so use with caution.

like image 34
user1429080 Avatar answered Oct 20 '22 03:10

user1429080