Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best incremental load method using SSIS with over 20 million records

Tags:

sql

oracle

ssis

What is needed: I'm needing 25 million records from oracle incrementally loaded to SQL Server 2012. It will need to have an UPDATE, DELETE, NEW RECORDS feature in the package. The oracle data source is always changing.

What I have: I've done this many times before but not anything past 10 million records.First I have an [Execute SQL Task] that is set to grab the result set of the [Max Modified Date]. I then have a query that only pulls data from the [ORACLE SOURCE] > [Max Modified Date] and have that lookup against my destination table.

I have the the [ORACLE Source] connecting to the [Lookup-Destination table], the lookup is set to NO CACHE mode, I get errors if I use partial or full cache mode because I assume the [ORACLE Source] is always changing. The [Lookup] then connects to a [Conditional Split] where I would input an expression like the one below.

(REPLACENULL(ORACLE.ID,"") != REPLACENULL(Lookup.ID,"")) 
|| (REPLACENULL(ORACLE.CASE_NUMBER,"") 
!= REPLACENULL(ORACLE.CASE_NUMBER,""))

I would then have the rows that the [Conditional Split] outputs into a staging table. I then add a [Execute SQL Task] and perform an UPDATE to the DESTINATION-TABLE with the query below:

 UPDATE Destination
 SET SD.CASE_NUMBER =UP.CASE_NUMBER,
     SD.ID           =   UP.ID,
 From Destination SD
 JOIN  STAGING.TABLE UP
    ON UP.ID = SD.ID

Problem: This becomes very slow and takes a very long time and it just keeps running. How can I improve the time and get it to work? Should I use a cache transformation? Should I use a merge statement instead?

How would I use the expression REPLACENULL in the conditional split when it is a data column? would I use something like :

(REPLACENULL(ORACLE.LAST_MODIFIED_DATE,"01-01-1900 00:00:00.000") 
 != REPLACENULL(Lookup.LAST_MODIFIED_DATE," 01-01-1900 00:00:00.000")) 

PICTURES BELOW:

Date Flow

Control flow

like image 262
LTLTLT Avatar asked Mar 13 '23 08:03

LTLTLT


2 Answers

A pattern that is usually faster for larger datasets is to load the source data into a local staging table then use a query like below to identify the new records:

SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
    SELECT * FROM TargetTable TGT 
    WHERE TGT.MatchKey = SRC.MatchKey
)

Then you just feed that dataset into an insert:

INSERT INTO TargetTable (column1,column 2)
SELECT column1,column 2
FROM StagingTable SRC
WHERE NOT EXISTS (
    SELECT * FROM TargetTable TGT 
    WHERE TGT.MatchKey = SRC.MatchKey
)

Updates look like this:

UPDATE TGT
SET 
column1 = SRC.column1,
column2 = SRC.column2,
DTUpdated=GETDATE()
FROM TargetTable TGT
WHERE EXISTS (
    SELECT * FROM TargetTable SRC
    WHERE TGT.MatchKey = SRC.MatchKey
)

Note the additional column DTUpdated. You should always have a 'last updated' column in your table to help with auditing and debugging.

This is an INSERT/UPDATE approach. There are other data load approaches such as windowing (pick a trailing window of data to be fully deleted and reloaded) but the approach depends on how your system works and whether you can make assumptions about data (i.e. posted data in the source will never be changed)

You can squash the seperate INSERT and UPDATE statements into a single MERGE statement, although it gets pretty huge, and I've had performance issues with it and there are other documented issues with MERGE

like image 54
Nick.McDermaid Avatar answered Mar 15 '23 13:03

Nick.McDermaid


Unfortunately, there's not a good way to do what you're trying to do. SSIS has some controls and documented ways to do this, but as you have found they don't work as well when you start dealing with large amounts of data.

At a previous job, we had something similar that we needed to do. We needed to update medical claims from a source system to another system, similar to your setup. For a very long time, we just truncated everything in the destination and rebuilt every night. I think we were doing this daily with more than 25M rows. If you're able to transfer all the rows from Oracle to SQL in a decent amount of time, then truncating and reloading may be an option.

We eventually had to get away from this as our volumes grew, however. We tried to do something along the lines of what you're attempting, but never got anything we were satisfied with. We ended up with a sort of non-conventional process. First, each medical claim had a unique numeric identifier. Second, whenever the medical claim was updated in the source system, there was an incremental ID on the individual claim that was also incremented.

Step one of our process was to bring over any new medical claims, or claims that had changed. We could determine this quite easily, since the unique ID and the "change ID" column were both indexed in source and destination. These records would be inserted directly into the destination table. The second step was our "deletes", which we handled with a logical flag on the records. For actual deletes, where records existed in destination but were no longer in source, I believe it was actually fastest to do this by selecting the DISTINCT claim numbers from the source system and placing them in a temporary table on the SQL side. Then, we simply did a LEFT JOIN update to set the missing claims to logically deleted. We did something similar with our updates: if a newer version of the claim was brought over by our original Lookup, we would logically delete the old one. Every so often we would clean up the logical deletes and actually delete them, but since the logical delete indicator was indexed, this didn't need to be done too frequently. We never saw much of a performance hit, even when the logically deleted records numbered in the tens of millions.

This process was always evolving as our server loads and data source volumes changed, and I suspect the same may be true for your process. Because every system and setup is different, some of the things that worked well for us may not work for you, and vice versa. I know our data center was relatively good and we were on some stupid fast flash storage, so truncating and reloading worked for us for a very, very long time. This may not be true on conventional storage, where your data interconnects are not as fast, or where your servers are not colocated.

When designing your process, keep in mind that deletes are one of the more expensive operations you can perform, followed by updates and by non-bulk inserts, respectively.

like image 32
Patrick Tucci Avatar answered Mar 15 '23 13:03

Patrick Tucci