Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to merge efficiently gigantic files with C#

I have over 125 TSV files of ~100Mb each that I want to merge. The merge operation is allowed destroy the 125 files, but not the data. What matter is that a the end, I end up with a big file of the content of all the files one after the other (no specific order).

Is there an efficient way to do that? I was wondering if Windows provides an API to simply make a big "Union" of all those files? Otherwise, I will have to read all the files and write a big one.

Thanks!

like image 888
Martin Avatar asked Aug 24 '10 13:08

Martin


2 Answers

So "merging" is really just writing the files one after the other? That's pretty straightforward - just open one output stream, and then repeatedly open an input stream, copy the data, close. For example:

static void ConcatenateFiles(string outputFile, params string[] inputFiles)
{
    using (Stream output = File.OpenWrite(outputFile))
    {
        foreach (string inputFile in inputFiles)
        {
            using (Stream input = File.OpenRead(inputFile))
            {
                input.CopyTo(output);
            }
        }
    }
}

That's using the Stream.CopyTo method which is new in .NET 4. If you're not using .NET 4, another helper method would come in handy:

private static void CopyStream(Stream input, Stream output)
{
    byte[] buffer = new byte[8192];
    int bytesRead;
    while ((bytesRead = input.Read(buffer, 0, buffer.Length)) > 0)
    {
        output.Write(buffer, 0, bytesRead);
    }
}

There's nothing that I'm aware of that is more efficient than this... but importantly, this won't take up much memory on your system at all. It's not like it's repeatedly reading the whole file into memory then writing it all out again.

EDIT: As pointed out in the comments, there are ways you can fiddle with file options to potentially make it slightly more efficient in terms of what the file system does with the data. But fundamentally you're going to be reading the data and writing it, a buffer at a time, either way.

like image 125
Jon Skeet Avatar answered Oct 01 '22 01:10

Jon Skeet


Do it from the command line:

copy 1.txt+2.txt+3.txt combined.txt

or

copy *.txt combined.txt
like image 36
Gabriel Magana Avatar answered Oct 01 '22 02:10

Gabriel Magana