Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter out duplicates from a loaded dataset in SSIS

I'm doing some ETL in SSIS to build some dimensional data sets. One of these is a date. When generating a set of dates for the dimension I can use a lookup against what's already in the date dimension and redirect any that fail, which are assumed to be new dates and then get added to the table.

Problem is the dataset that I've got might itself contain duplicate dates. This will cause errors with unique date keys when inserting into the dimension table. So I'm looking for a way to filter within the dataset that is loaded in the SSIS pipeline.

I could use DISTINCT on the initial loading of date but the date in this case is a DATETIME. I need to use a data conversion transformation later to turn this into a DATE by just taking the date component. I'm looking for unique days and a distinct on a DATETIME won't give me that.

I can't use SSIS lookup as I have before as that requires a connection manager that points to a database.

I could tell the OLE DB destination to not use bulk insert ignore any errors. This assumes however that the only errors will be duplicate dates.

I'm pretty new to SSIS and haven't been able to find a transformation tool that will allow me to compare to other rows in the set.

like image 600
Daniel Revell Avatar asked Nov 23 '11 14:11

Daniel Revell


1 Answers

You can either use a Sort Transformation and select remove duplicates, or you can use the Aggregate transformation and only use group by (which will be more or less like a DISTINCT). Note that these operations are async, meaning all rows must enter this task before they continue, as opposed to sync tasks that just eats and spits out buffers of rows as they come in.

like image 146
cairnz Avatar answered Oct 16 '22 04:10

cairnz