I got a task to load a strangely formatted text file. The file contains unwanted data too. It contains two headers back to back and data for each header is specified on alternate lines. Header rows start after ------
. I need to read both the header along with its corresponding data and dump it into some Excel/table destination using. Let me know how to solve this using any transformation in SSIS or maybe with a script.
Don't know how to use a script task for this.
Right now I am reading the file in one column and using a derived column manually trying to split it using substring
function. But that works for only one header and it is too hard coded type. I need some dynamic approach to read header rows as well as data rows directly.
Input file:
A1234-012 I N F O R M A T I C S C O M P A N Y 08/23/17
PAGE 2 BATCH ABC PAYMENT DATE & DUE DATE EDIT PAGE 481
------------------------------------------------------------------------------------------------------------------------------------
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT
NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND
1-3 4-6 7-13/90-102 14-19 20-25 26-31 32-34 35-37 38-46 47-48 49 50-51 52-61 62 63 64-72 73 4-5 76 77 8-80
------------------------------------------------------------------------------------------------------------------------------------
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F
NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT
1-3 4-6 7-13/90-102 14 15 20-23 24-29 30-34 35-37 38-42 43 44 49 60 61-63 64-69
USED-ID:
------------------------------------------------------------------------------------------------------------------------------------
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
Expected output should be:
FILE 1:
SEO XRAT CLT LOAN OPENING PAYMENT MATURIUH LOAN NEXE ORIG-AMT OFF TO CATE CONTC MON NO.TO TOL NEL S CUP CO IND PAT
NOM CODE NOM NOMTER DATE DUO DATE DATE TIME PT # MONEY AQ LOAN NUMBER BLOCK PAYMENT U TYP GH OMG IND
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
454542 070 567 2136547895 08-08-18 08-06-18 11-02-18 123 256 62,222 LK 5 55 5463218975 5 3 5,555.22 33 H55
FILE 2:
SEO XRAT CLT LOAN A/C A/C MIN MAX MAX PENDI LATE CCH L/F PARTLYS CUR L/F L/F L/F
NOM CODE NOM NOMTER CODE FACTOR MON MON ROAD DAYS MONE POT L/A L/F JAC INT VAD CD USED PI VAD DT
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
025641 055 123 5144511352 B .55321 2.55 6531.22 H #AS
To efficiently parse fixed width files with Python, we can use the Pandas' read_fwf method. to define the col_specification list with the column specifications for filename. txt. Then we call read.
Data in a fixed-width text file is arranged in rows and columns, with one entry per row. Each column has a fixed width, specified in characters, which determines the maximum amount of data it can contain.
In the example above, the V record (visit) has 11 columns, while the P record (procedure) has only five. Therefore, the flat file connection manager should be configured with 11 columns, with the data type of each column set up to handle the largest or broadest data type in any row type for that column.
Combining several different record types into one file eliminates a small bit of work by eliminating multiple data files, but also adds complexity to those systems consuming the resulting hybrid files. In this post I’ll show an example of this, and I’ll demonstrate a design pattern to handle this type of data formatting.
Add a data flow activity, which will be used for processing fixed-width files: In the data flow activity, select New mapping data flow. Add a Source, Derived Column, Select, and Sink transformation: Configure the Source transformation to use a new dataset, which will be of the Delimited Text type. Don't set any column delimiter or headers.
What is a fixed width text file? A fixed width file is similar to a csv file, but rather than using a delimiter, each field has a set number of characters. This creates files with all the data tidily lined up with an appearance similar to a spreadsheet when opened in a text editor.
To ignore first 3 rows you can simply configure the flat file connection manager to ignore them, similar to:
1. Configure connection managers
In addition, in the flat file connection manager, go to the advanced tab and delete all columns except one and change its data type to DT_STR
and the MaxLength to 4000
.
Add two connection managers , one for each destination file where you must define only one column with max length = 4000:
2. Configure Data flow task
Add a Data Flow Task, And add a Flat File Source inside. Select the Source File connection manager.
Add a conditional split with the following expressions:
File1
FINDSTRING([Column 0],"OPENING",1) > 1 || FINDSTRING([Column 0],"DATE",1) > 1 || TOKENCOUNT([Column 0]," ") == 19
File2
FINDSTRING([Column 0],"A/C",1) > 1 || FINDSTRING([Column 0],"FACTOR",1) > 1 || TOKENCOUNT([Column 0]," ") == 10
The expressions above are created based on the expected output you mentioned in the question, i tired to search for unique keywords inside each header and splitted the data rows based on the number of space occurrence.
Finally Map each output to a destination flat file component:
The execution result is shown in the following screenshots:
To remove duplicates you must you can refer to the following link:
If you need only to remove duplicate headers then you can do this in two steps:
In addition, because the columns values does not contains spaces you can use regular expression to replace spaces with single Tab to make the file consistent.
Script Component
In the Script Component add an output column of type DT_BOOL and name it outFlag
also add a output column outColumn0
of type DT_STR
and length equal to 4000
and select Column0
as Input Column.
Then write the following script in the Script Editor (C#):
First make sure that you add the RegularExpressions namespace
using System.Text.RegularExpressions;
Script Code
int SEOCount = 0;
int NOMCount = 0;
Regex regex = new Regex("[ ]{2,}", RegexOptions.None);
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (Row.Column0.Trim().StartsWith("SEO"))
{
if (SEOCount == 0)
{
SEOCount++;
Row.outFlag = true;
}
else
{
Row.outFlag = false;
}
}
else if (Row.Column0.Trim().StartsWith("NOM"))
{
if (NOMCount == 0)
{
NOMCount++;
Row.outFlag = true;
}
else
{
Row.outFlag = false;
}
}
else if (Row.Column0.Trim().StartsWith("PAGE"))
{
Row.outFlag = false;
}
else
{
Row.outFlag = true;
}
Row.outColumn0 = regex.Replace(Row.Column0.TrimStart(), "\t");
}
Conditional Split
Add a conditional split after each Script Component and use the following expression to filter duplicate header:
[outFlag] == True
And connect the conditional split to the destination. Make Sure to map outColumn0
to the destination column.
Package link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With