Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading Multiple Tables using SSIS keeping foreign key relationships

I am trying to load data from a single file (with million+ records) into multiple tables on SQL Server using SSIS while maintaining the relationships defined in the file.

To better elaborate with an example, lets assume I am trying to load a file containing employee name, the offices they have occupied in the past and their Job title history separated by a tab.

File:

EmployeeName<tab>OfficeHistory<tab>JobLevelHistory
John Smith<tab>501<tab>Engineer
John Smith<tab>601<tab>Senior Engineer
John Smith<tab>701<tab>Manager
Alex Button<tab>601<tab>Senior Assistant
Alex Button<tab>454<tab>Manager

If my Office database schema has the following tables:

Employee (nId, name)
Office (nId, number)
JobTitle (nId, titleName)
Employee2Office (nEmpID, nOfficeId)
Employee2JobTitle (nEmpId, nJobTitleID)

How can I use SSIS to load the file into the schema above Auto-Generating IDs for Employee, Office and JobTitle and maintaining the relationship between the employee and offices, and employee and Job Titles?

So in this case. the tables should look like:

Employee
1 John Smith
2 Alex Button

Office
1 501
2 601
3 701
4 454

JobTitle
1 Engineer
2 Senior Engineer
3 Manager
4 Senior Assistant

Employee2Office
1 1
1 2
1 3
2 2
2 4

Employee2JobTitle
1 1
1 2
1 3
2 4
2 3

I am a newbie to SSIS and have not played around with auto-generating IDs and establishing foreign key relationships while executing a Data Flow Task. Any pointers would be appreciated.

Thanks!

like image 498
mvm Avatar asked Feb 06 '10 03:02

mvm


People also ask

Can you relate data to multiple tables using a foreign key?

Foreign keys link data in one table to the data in another table. A foreign key column in a table points to a column with unique values in another table (often the primary key column) to create a way of cross-referencing the two tables.

Is it possible to link two tables using two foreign keys?

If there is exactly one foreign key with the same name as a table in the join, SQL Anywhere uses it to generate the join condition. If there is more than one foreign key with the same name as a table, the join is ambiguous and an error is issued.

How do I create a foreign key between two tables?

To create a new table containing a foreign key column that references another table, use the keyword FOREIGN KEY REFERENCES at the end of the definition of that column. Follow that with the name of the referenced table and the name of the referenced column in parentheses.


2 Answers

An interesting question. Here is how I would do it (Sql Server 2005). (I'm assuming this is a monthly job and not just a one time, so I added code for repeatability.)

  1. Create three variables for the Employee, JobTitle and Office table (type = Object)
  2. Use three sql tasks to select the rows for of those three tables into the corresponding variables.
  3. Add a dataflow task.
  4. Select from your flat file using a flat file destination.
  5. Output goes into a script component with the three columns in the flat file as input, the three table variable imported into the script, five outputs in the script component each with the same exclusion group number and the input marked as synchronous for that output, seven new columns (3 for emp one for each output it will be in, 2 for job, 2 for office) added to the output, and with the following code (A Reference to System.xml.dll had to be added to make this all work.):

    Imports System
    Imports System.Data
    Imports System.Math
    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
    Imports System.Collections
    Imports System.Data.OleDb
    
    Public Class ScriptMain
        Inherits UserComponent
    
        Private da As New OleDbDataAdapter
        Private emp As New DataTable
        Private emph As New Hashtable()
        Private job As New DataTable
        Private jobh As New Hashtable()
        Private off As New DataTable
        Private offh As New Hashtable()
        Private maxempid As Integer
        Private maxjobid As Integer
        Private maxoffid As Integer
    
        Public Overrides Sub PreExecute()
            maxempid = 0
            maxjobid = 0
            maxoffid = 0
            da.Fill(emp, Me.Variables.EmpTab)
            For Each dr As DataRow In emp.Rows
                emph.Add(dr.Item("Name"), dr.Item("nID"))
                If (CInt(dr.Item("nID").ToString) > maxempid) Then
                    maxempid = CInt(dr.Item("nID").ToString)
                End If
            Next
            da.Fill(job, Me.Variables.JobTab)
            For Each dr As DataRow In job.Rows
                jobh.Add(dr.Item("titleName"), dr.Item("nID"))
                If (CInt(dr.Item("nID").ToString) > maxempid) Then
                    maxjobid = CInt(dr.Item("nID").ToString)
                End If
            Next
            da.Fill(off, Me.Variables.OffTab)
            For Each dr As DataRow In off.Rows
                offh.Add(dr.Item("number"), dr.Item("nID"))
                If (CInt(dr.Item("nID").ToString) > maxempid) Then
                    maxoffid = CInt(dr.Item("nID").ToString)
                End If
            Next
            emp.Dispose()
            job.Dispose()
            off.Dispose()
            da.Dispose()
            MyBase.PreExecute()
        End Sub
    
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
    
        If Not emph.ContainsKey(Row.EmployeeName) Then
            maxempid += 1
            emph.Add(Row.EmployeeName, maxempid)
            Row.EmpId = maxempid
            Row.Emp2Id = maxempid
            Row.Emp3Id = maxempid
            Row.DirectRowToEmployee()
        Else
            Row.EmpId = CInt(emph.Item(Row.EmployeeName).ToString)
            Row.Emp2Id = CInt(emph.Item(Row.EmployeeName).ToString)
            Row.Emp3Id = CInt(emph.Item(Row.EmployeeName).ToString)
        End If
        If Not jobh.ContainsKey(Row.JobLevelHistory) Then
            maxjobid += 1
            jobh.Add(Row.JobLevelHistory, maxjobid)
            Row.JobId = maxjobid
            Row.Job2Id = maxjobid
            Row.DirectRowToJobTitle()
        Else
            Row.JobId = CInt(jobh.Item(Row.JobLevelHistory).ToString)
            Row.Job2Id = CInt(jobh.Item(Row.JobLevelHistory).ToString)
        End If
        If Not offh.ContainsKey(Row.OfficeHistory) Then
            maxoffid += 1
            offh.Add(Row.OfficeHistory, maxoffid)
            Row.OffId = maxoffid
            Row.Off2Id = maxoffid
            Row.DirectRowToOfficeNumber()
        Else
            Row.OffId = CInt(offh.Item(Row.OfficeHistory).ToString)
            Row.Off2Id = CInt(offh.Item(Row.OfficeHistory).ToString)
        End If
        Row.DirectRowToEmp2Job()
        Row.DirectRowToEmp2Off()
    End Sub        
    End Class
    
  6. The results from this script (The script generates id's for new values in the input data. It does this by loading up the existing table into hashtables in the preexecute portion of the script, then by checking for existance of the name and based upon that either increments the maxid and adds it to the hash if it adds to the hash, it also adds the row to the appropriate (emp, job, or off) output, or retrieves the maxid from the hash for each row.) all row regardless of status above will be written to the two remaining outputs (emp2job and emp2off).

  7. Next, send dataflows to lookup (to check for existing rows in the destination table, then the destination oledb connectors (the emp, job, and off check the box for identity inserts, the emp2job and emp2off uncheck check constraints).
like image 99
William Salzman Avatar answered Oct 18 '22 03:10

William Salzman


If you are sure that your referential integrity is OK with the data that you want to load, you can disable foreign key constraints in a Script Task, then execute data flow with parallel data load and after data load finishes, enable the constraints again. If there is something not right with the data, the operation will fail. You will have to design rollback or cleanup strategy though.

The other option is just to load data in serial way, starting from main tables and finishing on child tables. I think this is 'safer' option, as it doesn't expose your data integrity to other users who may be using these tables at the time of ETL load. I would prefer this option.

like image 33
Piotr Rodak Avatar answered Oct 18 '22 01:10

Piotr Rodak