Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load grouped data with SSIS

I have a tricky flat file data source. The data is grouped, like this:

Country    City
U.S.       New York
           Washington
           Baltimore
Canada     Toronto
           Vancouver

But I want it to be this format when it's loaded in to the database:

Country    City
U.S.       New York
U.S.       Washington
U.S.       Baltimore
Canada     Toronto
Canada     Vancouver

Anyone has met such a problem before? Got a idea to deal with it?
The only idea I got now is to use the cursor, but the it is just too slow.
Thank you!

like image 644
William Xu Avatar asked Apr 13 '16 02:04

William Xu


People also ask

How is fuzzy grouping used in SSIS?

The Fuzzy Grouping transformation includes two features for customizing the grouping it performs: token delimiters and similarity threshold. The transformation provides a default set of delimiters used to tokenize the data, but you can add new delimiters that improve the tokenization of your data.

What is Fuzzy Lookup in SSIS?

The Fuzzy Lookup transformation performs data cleaning tasks such as standardizing data, correcting data, and providing missing values.


2 Answers

The answer by cha will work, but here is another in case you need to do it in SSIS without temporary/staging tables:

You can run your dataflow through a Script Transformation that uses a DataFlow-level variable. As each row comes in the script checks the value of the Country column.

If it has a non-blank value, then populate the variable with that value, and pass it along in the dataflow.

If Country has a blank value, then overwrite it with the value of the variable, which will be last non-blank Country value you got.

EDIT: I looked up your error message and learned something new about Script Components (the Data Flow tool, as opposed to Script Tasks, the Control Flow tool):

The collection of ReadWriteVariables is only available in the PostExecute method to maximize performance and minimize the risk of locking conflicts. Therefore you cannot directly increment the value of a package variable as you process each row of data. Increment the value of a local variable instead, and set the value of the package variable to the value of the local variable in the PostExecute method after all data has been processed. You can also use the VariableDispenser property to work around this limitation, as described later in this topic. However, writing directly to a package variable as each row is processed will negatively impact performance and increase the risk of locking conflicts.

That comes from this MSDN article, which also has more information about the Variable Dispenser work-around, if you want to go that route, but apparently I mislead you above when I said you can set the value of the package variable in the script. You have to use a variable that is local to the script, and then change it in the Post-Execute event handler. I can't tell from the article whether that means that you will not be able to read the variable in the script, and if that's the case, then the Variable Dispenser would be the only option. Or I suppose you could create another variable that the script will have read-only access to, and set its value to an expression so that it always has the value of the read-write variable. That might work.

like image 97
Tab Alleman Avatar answered Oct 02 '22 05:10

Tab Alleman


Yes, it is possible. First you need to load the data to a table with an IDENTITY column:

-- drop table #t
CREATE TABLE #t (id INTEGER IDENTITY PRIMARY KEY,
Country VARCHAR(20),
City VARCHAR(20))

INSERT INTO #t(Country, City)
SELECT a.Country, a.City
 FROM OPENROWSET( BULK 'c:\import.txt', 
     FORMATFILE = 'c:\format.fmt',
     FIRSTROW = 2) AS a;

select * from #t

The result will be:

id          Country              City
----------- -------------------- --------------------
1           U.S.                 New York
2                                Washington
3                                Baltimore
4           Canada               Toronto
5                                Vancouver

And now with a bit of recursive CTE magic you can populate the missing details:

;WITH a as(
    SELECT Country
          ,City
          ,ID
    FROM #t WHERE ID = 1
    UNION ALL
    SELECT COALESCE(NULLIF(LTrim(#t.Country), ''),a.Country)
          ,#t.City
          ,#t.ID
    FROM a INNER JOIN #t ON a.ID+1 = #t.ID
    )
SELECT * FROM a
 OPTION (MAXRECURSION 0)

Result:

Country              City                 ID
-------------------- -------------------- -----------
U.S.                 New York             1
U.S.                 Washington           2
U.S.                 Baltimore            3
Canada               Toronto              4
Canada               Vancouver            5

Update:

As Tab Alleman suggested below the same result can be achieved without the recursive query:

SELECT ID
     , COALESCE(NULLIF(LTrim(a.Country), ''), (SELECT TOP 1 Country FROM #t t WHERE t.ID < a.ID AND LTrim(t.Country) <> '' ORDER BY t.ID DESC))
     , City
FROM #t a

BTW, the format file for your input data is this (if you want to try the scripts save the input data as c:\import.txt and the format file below as c:\format.fmt):

9.0
  2
  1       SQLCHAR       0       11      ""       1     Country      SQL_Latin1_General_CP1_CI_AS
  2       SQLCHAR       0       100     "\r\n"   2     City         SQL_Latin1_General_CP1_CI_AS
like image 35
cha Avatar answered Oct 02 '22 04:10

cha