I'm trying to create an SSIS package to process files from a directory that contains many years worth of files. The files are all named numerically, so to save processing everything, I want to pass SSIS a minimum number, and only enumerate files whose name (converted to a number) is higher than my minimum.
I've tried letting the ForEach File loop enumerate everything and then exclude files in a Script Task, but when dealing with hundreds of thousands of files, this is way too slow to be suitable.
The FileSpec property lets you specify a file mask to dictate which files you want in the collection, but I can't quite see how to specify an expression to make that work, as it's essentially a string match.
If there's an expression within the component somewhere which basically says Should I Enumerate? - Yes / No
, that would be perfect. I've been experimenting with the below expression, but can't find a property to which to apply it.
(DT_I4)REPLACE( SUBSTRING(@[User::ActiveFilePath],FINDSTRING( @[User::ActiveFilePath], "\", 7 ) + 1 ,100),".txt","") > @[User::MinIndexId] ? "True" : "False"
The variable must be of Object data type. Specify the connection to a database and the schema to enumerate.
FileSpec - This variable of String data type will hold the file pattern to loop through. Set the expression of this variable to below mentioned value. This expression will use the extension specified on the parameters. If there are no extensions, it will *. * to loop through all files.
The Foreach ADO enumerator lets you access rows of data in a variable configured with the Object data type. So once I select the enumerator type, I select the JobTitles variable from the ADO object source variable drop-down list.
In the SSIS Toolbox, expand Containers, and then drag a Foreach Loop Container onto the design surface of the Control Flow tab. Right-click the new Foreach Loop Container and select Edit. In the Foreach Loop Editor dialog, on the General page, for Name, enter Foreach File in Folder. Select OK.
From investigating how the ForEach loop works in SSIS (with a view to creating my own to solve the issue) it seems that the way it works (as far as I could see anyway) is to enumerate the file collection first, before any mask is specified. It's hard to tell exactly what's going on without seeing the underlying code for the ForEach loop but it seems to be doing it this way, resulting in slow performance when dealing with over 100k files.
While @Siva's solution is fantastically detailed and definitely an improvement over my initial approach, it is essentially just the same process, except using an Expression Task to test the filename, rather than a Script Task (this does seem to offer some improvement).
So, I decided to take a totally different approach and rather than use a file-based ForEach loop, enumerate the collection myself in a Script Task, apply my filtering logic, and then iterate over the remaining results. This is what I did:
In my Script Task, I use the asynchronous DirectoryInfo.EnumerateFiles
method, which is the recommended approach for large file collections, as it allows streaming, rather than having to wait for the entire collection to be created before applying any logic.
Here's the code:
public void Main()
{
string sourceDir = Dts.Variables["SourceDirectory"].Value.ToString();
int minJobId = (int)Dts.Variables["MinIndexId"].Value;
//Enumerate file collection (using Enumerate Files to allow us to start processing immediately
List<string> activeFiles = new List<string>();
System.Threading.Tasks.Task listTask = System.Threading.Tasks.Task.Factory.StartNew(() =>
{
DirectoryInfo dir = new DirectoryInfo(sourceDir);
foreach (FileInfo f in dir.EnumerateFiles("*.txt"))
{
FileInfo file = f;
string filePath = file.FullName;
string fileName = filePath.Substring(filePath.LastIndexOf("\\") + 1);
int jobId = Convert.ToInt32(fileName.Substring(0, fileName.IndexOf(".txt")));
if (jobId > minJobId)
activeFiles.Add(filePath);
}
});
//Wait here for completion
System.Threading.Tasks.Task.WaitAll(new System.Threading.Tasks.Task[] { listTask });
Dts.Variables["ActiveFilenames"].Value = activeFiles;
Dts.TaskResult = (int)ScriptResults.Success;
}
So, I enumerate the collection, applying my logic as files are discovered and immediately adding the file path to my list for output. Once complete, I then assign this to an SSIS Object variable named ActiveFilenames which I'll use as the collection for my ForEach loop.
I configured the ForEach loop as a ForEach From Variable Enumerator, which now iterates over a much smaller collection (Post-filtered List<string>
compared to what I can only assume was an unfiltered List<FileInfo>
or something similar in SSIS' built-in ForEach File Enumerator.
So the tasks inside my loop can just be dedicated to processing the data, since it has already been filtered before hitting the loop. Although it doesn't seem to be doing much different to either my initial package or Siva's example, in production (for this particular case, anyway) it seems like filtering the collection and enumerating asynchronously provides a massive boost over using the built in ForEach File Enumerator.
I'm going to continue investigating the ForEach loop container and see if I can replicate this logic in a custom component. If I get this working I'll post a link in the comments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With