Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove rows with same keys

Tags:

rows

ssis

I have data with rows like the following:

Name1  Name2  Name3  Col  
aaa    bbb    ccc    ...  
abc    ddd    ddd    1  
abc    ddd    ddd    2  
abc    ddd    ddd    3  
fff    fff    fff    ...  
ggg    ggg    hhh    4  
ggg    ggg    hhh    5  

(Name1, Name2 and Name3 are primary keys)

How do I remove the first rows from the set of data with the same 3 primary keys? (leaving only the last row of the set)

I.e. the result from the above would be:

Name1  Name2  Name3  Col  
aaa    bbb    ccc    ...  
abc    ddd    ddd    3  
fff    fff    fff    ...  
ggg    ggg    hhh    5  
like image 782
Albert Y Avatar asked Mar 04 '13 00:03

Albert Y


3 Answers

Assuming your source data is in the correct order, and you want the last record in each set, there aren't any out-of-the-box transformations that would handle this situation. However, a Script Transformation can deal with it fairly easily.

Here's a sample Data Flow:

enter image description here

I'm using FF_SRC_AllRows and FF_DST_SelectedRows as flat file source and destination (respectively) for simplicity's sake, using the sample data you provided; your specific needs will vary. The script transformation SCR_SelectLastRow is configured as a transformation (both input and output):

enter image description here

Select all your input columns (with usage type ReadOnly):

enter image description here

Create one output (I named mine OutgoingRows, but you can name it whatever you wish), and set the SynchronousInputID property to None. This will let your script filter out the rows you don't want.

enter image description here

Add output columns corresponding to the input columns:

enter image description here

And use code along these lines:

/* Microsoft SQL Server Integration Services Script Component
*  Write scripts using Microsoft Visual C# 2008.
*  ScriptMain is the entry point class of the script.*/

using System;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
    class IncomingRowData
    {
        public string Name1;
        public string Name2;
        public string Name3;
        public string Col;
        public IncomingRowData(IncomingRowsBuffer Row)
        {
            Name1 = Row.Name1;
            Name2 = Row.Name2;
            Name3 = Row.Name3;
            Col = Row.Col;
        }
        public bool KeysDiffer(IncomingRowData other)
        {
            return (Name1 != other.Name1
                || Name2 != other.Name2
                || Name3 != other.Name3);
        }
        public void WriteToOutputBuffer(OutgoingRowsBuffer Row)
        {
            Row.AddRow();
            Row.Name1 = Name1;
            Row.Name2 = Name2;
            Row.Name3 = Name3;
            Row.Col = Col;
        }
    }

    private IncomingRowData _previousRow;

    public override void IncomingRows_ProcessInputRow(IncomingRowsBuffer Row)
    {
        if (_previousRow == null)
        {
            _previousRow = new IncomingRowData(Row);
        }
        IncomingRowData currentRow = new IncomingRowData(Row);
        if (currentRow.KeysDiffer(_previousRow))
        {
            _previousRow.WriteToOutputBuffer(this.OutgoingRowsBuffer);
        }
        _previousRow = currentRow;
    }

    public override void FinishOutputs()
    {
        if (_previousRow != null)
        {
            _previousRow.WriteToOutputBuffer(this.OutgoingRowsBuffer);
        }
        base.FinishOutputs();
    }
}

One nice thing about this technique is that it allows you to handle the data in a single pass and requires neither the use of staging tables nor keeping the entire source data set in memory. Depending on how big your data set is, either of those could cause significant performance issues.

like image 157
Edmund Schweppe Avatar answered Oct 17 '22 14:10

Edmund Schweppe


Recommendation #1: Do this in your source query if at all possible.

Assuming that's not possible, and assuming that you always want the maximum value for Col to be selected, you can use the Aggregate component in your data flow.

Just add all of your columns to the aggregate input, and for the operations choose "Group By" for Name1, Name2, and Name3, and "Maximum" for Col.

Unfortunately, the aggregate component is an asynchronous component - meaning your entire flow will pause while the data flows into it, because it won't know the "maximum" value of each set until it's read in every row.

like image 28
Kyle Hale Avatar answered Oct 17 '22 16:10

Kyle Hale


SELECT name1, 
       name2, 
       name3, 
       col 
FROM   (SELECT name1, 
               name2, 
               name3, 
               col, 
               Max(rn) 
                 over ( 
                   PARTITION BY name1, name2, name3 ) AS max_rn, 
               rn 
        FROM   (SELECT name1, 
                       name2, 
                       name3, 
                       col, 
                       Row_number() 
                         over ( 
                           PARTITION BY name1, name2, name3 
                           ORDER BY col ) AS rn 
                FROM   test1)) 
WHERE  max_rn = rn; 

You can try this where test1 is tablename

like image 1
fortm Avatar answered Oct 17 '22 14:10

fortm