Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to read multiple very large files

I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines.

For example, the master file should look something like this:

File1_Row1,File2_Row1,File3_Row1,...

File1_Row2,File2_Row2,File3_Row2,...

File1_Row3,File2_Row3,File3_Row3,...

etc.

I've tried 2 solutions so far:

  1. Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used.
  2. Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. The issue with this solution is that it is very very slow.

Does anybody have a better solution for reading so many large files in a fast way?

like image 428
jmm1487 Avatar asked Jan 12 '23 23:01

jmm1487


1 Answers

The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. Then you loop through each reader and read a single line and write it to the master file. This way you are only loading one line at a time so there should be minimal memory pressure. I was able to copy 80 ~500,000 line files in 37 seconds. An example:

using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;

class MainClass
{
    static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();

    public static void Main(string[] args)
    {
        var stopwatch = Stopwatch.StartNew();
        List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();

        try
        {
            using (StreamWriter writer = new StreamWriter("master.txt"))
            {
                string line = null;
                do
                {
                    for(int i = 0; i < readers.Count; i++)
                    {
                        if ((line = readers[i].ReadLine()) != null)
                        {
                            writer.Write(line);
                        }
                        if (i < readers.Count - 1)
                            writer.Write(",");
                    }
                    writer.WriteLine();
                } while (line != null);
            }
        }
        finally
        {
            foreach(var reader in readers)
            {
                reader.Close();
            }
        }
        Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
    }
}

I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data.

like image 195
Mike Zboray Avatar answered Jan 22 '23 19:01

Mike Zboray