Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multicore Text File Parsing

I have a quad core machine and would like to write some code to parse a text file that takes advantage of all four cores. The text file basically contains one record per line.

Multithreading isn't my forte so I'm wondering if anyone could give me some patterns that I might be able to use to parse the file in an optimal manner.

My first thoughts are to read all the lines into some sort of queue and then spin up threads to pull the lines off the queue and process them, but that means the queue would have to exist in memory and these are fairly large files so I'm not so keen on that idea.

My next thoughts are to have some sort of controller that will read in a line and assign it a thread to parse, but I'm not sure if the controller will end up being a bottleneck if the threads are processing the lines faster than it can read and assign them.

I know there's probably another simpler solution than both of these but at the moment I'm just not seeing it.

like image 238
lomaxx Avatar asked Aug 10 '08 03:08

lomaxx


5 Answers

I'd go with your original idea. If you are concerned that the queue might get too large implement a buffer-zone for it (i.e. If is gets above 100 lines the stop reading the file and if it gets below 20 then start reading again. You'd need to do some testing to find the optimal barriers). Make it so that any of the threads can potentially be the "reader thread" as it has to lock the queue to pull an item out anyway it can also check to see if the "low buffer region" has been hit and start reading again. While it's doing this the other threads can read out the rest of the queue.

Or if you prefer, have one reader thread assign the lines to three other processor threads (via their own queues) and implement a work-stealing strategy. I've never done this so I don't know how hard it is.

like image 91
Mike Minutillo Avatar answered Nov 17 '22 00:11

Mike Minutillo


Mark's answer is the simpler, more elegant solution. Why build a complex program with inter-thread communication if it's not necessary? Spawn 4 threads. Each thread calculates size-of-file/4 to determine it's start point (and stop point). Each thread can then work entirely independently.

The only reason to add a special thread to handle reading is if you expect some lines to take a very long time to process and you expect that these lines are clustered in a single part of the file. Adding inter-thread communication when you don't need it is a very bad idea. You greatly increase the chance of introducing an unexpected bottleneck and/or synchronization bugs.

like image 41
Derek Park Avatar answered Nov 17 '22 00:11

Derek Park


This will eliminate bottlenecks of having a single thread do the reading:

open file
for each thread n=0,1,2,3:
    seek to file offset 1/n*filesize
    scan to next complete line
    process all lines in your part of the file
like image 44
Mark Harrison Avatar answered Nov 17 '22 00:11

Mark Harrison


My experience is with Java, not C#, so apologies if these solutions don't apply.

The immediate solution I can think up off the top of my head would be to have an executor that runs 3 threads (using Executors.newFixedThreadPool, say). For each line/record read from the input file, fire off a job at the executor (using ExecutorService.submit). The executor will queue requests for you, and allocate between the 3 threads.

Probably better solutions exist, but hopefully that will do the job. :-)

ETA: Sounds a lot like Wolfbyte's second solution. :-)

ETA2: System.Threading.ThreadPool sounds like a very similar idea in .NET. I've never used it, but it may be worth your while!

like image 1
Chris Jester-Young Avatar answered Nov 17 '22 00:11

Chris Jester-Young


Since the bottleneck will generally be in the processing and not the reading when dealing with files I'd go with the producer-consumer pattern. To avoid locking I'd look at lock free lists. Since you are using C# you can take a look at Julian Bucknall's Lock-Free List code.

like image 1
graham.reeds Avatar answered Nov 17 '22 01:11

graham.reeds