Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate rows from flat file using SSIS?

Let me first say that being able to take 17 million records from a flat file, pushing to a DB on a remote box and having it take 7 minutes is amazing. SSIS truly is fantastic. But now that I have that data up there, how do I remove duplicates?

Better yet, I want to take the flat file, remove the duplicates from the flat file and put them back into another flat file.

I am thinking about a:

Data Flow Task

  • File source (with an associated file connection)
  • A for loop container
  • A script container that contains some logic to tell if another row exists

Thak you, and everyone on this site is incredibly knowledgeable.

Update: I have found this link, might help in answering this question

like image 354
RyanKeeter Avatar asked Sep 29 '08 21:09

RyanKeeter


People also ask

How can we delete duplicate rows from flat files?

We can eliminate duplicate rows from the flat-file by using group by function in an aggregator or in source qualifier in database. You can select distinct all or by using sorter transformation in flat-file.

Does Union all remove duplicates in SSIS?

Description. The SQL UNION ALL operator is used to combine the result sets of 2 or more SELECT statements. It does not remove duplicate rows between the various SELECT statements (all rows are returned).


2 Answers

Use the Sort Component.

Simply choose which fields you wish to sort your loaded rows by and in the bottom left corner you'll see a check box to remove duplicates. This box removes any rows which are duplicates based on the sort criteria only so in the example below the rows would be considered duplicate if we only sorted on the first field:

1 | sample A |
1 | sample B |
like image 64
Craig Warren Avatar answered Oct 09 '22 18:10

Craig Warren


I would suggest using SSIS to copy the records to a temporary table, then create a task that uses Select Distinct or Rank depending on your situation to select the duplicates which would funnel them to a flat file and delete them from the temporary table. The last step would be to copy the records from the temporary table into the destination table.

Determining a duplicate is something SQL is good at but a flat file is not as well suited for. In the case you proposed, the script container would load a row and then would have to compare it against 17 million records, then load the next row and repeat...The performance might not be all that great.

like image 34
Timothy Lee Russell Avatar answered Oct 09 '22 18:10

Timothy Lee Russell