Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read large txt file multithreaded?

Tags:

c#

I have large txt file with 100000 lines. I need to start n-count of threads and give every thread unique line from this file.

What is the best way to do this? I think I need to read file line by line and iterator must be global to lock it. Loading the text file to list will be time-consuming and I can receive OutofMemory exception. Any ideas?

like image 637
obdgy Avatar asked Jun 19 '13 09:06

obdgy


3 Answers

You can use the File.ReadLines Method to read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Method to process the lines in multiple threads in parallel:

Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
    // your code here
});
like image 184
dtb Avatar answered Nov 15 '22 05:11

dtb


Read the file on one thread, adding its lines to a blocking queue. Start N tasks reading from that queue. Set max size of the queue to prevent out of memory errors.

like image 34
Sergey Kalinichenko Avatar answered Nov 15 '22 05:11

Sergey Kalinichenko


After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support @dtb's answer above that using the following approach is the fastest:

Parallel.ForEach(File.ReadLines(catalogPath), line =>
{

}); 

My tests also showed the following:

  1. File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
  2. Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
  3. I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.

I have included an example of this pattern for reference, since it is not included on this page:

var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();

var readLines = Task.Factory.StartNew(() =>
{
    foreach (var line in File.ReadLines(catalogPath)) 
        inputLines.Add(line);

        inputLines.CompleteAdding(); 
});

var processLines = Task.Factory.StartNew(() =>
{
    Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
    {
        string[] lineFields = line.Split('\t');
        int genomicId = int.Parse(lineFields[3]);
        int taxId = int.Parse(lineFields[0]);
        catalog.TryAdd(genomicId, taxId);   
    });
});

Task.WaitAll(readLines, processLines);

Here are my benchmarks:

enter image description here

I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.

like image 34
Jake Drew Avatar answered Nov 15 '22 07:11

Jake Drew