Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Options to use multithreading to process a group of database records?

I have a database table that contains some records to be processed. The table has a flag column that represents the following status values. 1 - ready to be processed, 2- successfully processed, 3- processing failed.

The .net code (repeating process - console/service) will grab a list of records that are ready to be processed, and loop through them and attempt to process them (Not very lengthy), update status based on success or failure.

To have better performance, I want to enable multithreading for this process. I'm thinking to spawn say 6 threads, each threads grabbing a subset.

Obviously I want to avoid having different threads process the same records. I dont want to have a "Being processed" flag in the database to handle the case where the thread crashes leaving the record hanging.

The only way I see doing this is to grab the complete list of available records and assigning a group (maybe ids) to each thread. If an individual thread fails, its unprocessed records will be picked up next time the process runs.

Is there any other alternatives to dividing the groups prior to assigning them to threads?

like image 744
Alex J Avatar asked Jun 06 '12 19:06

Alex J


2 Answers

The most straightforward way to implement this requirement is to use the Task Parallel Library's

Parallel.ForEach (or Parallel.For).

Allow it to manage individual worker threads.

From experience, I would recommend the following:

  • Have an additional status "Processing"
  • Have a column in the database that indicates when a record was picked up for processing and a cleanup task / process that runs periodically looking for records that have been "Processing" for far too long (reset the status to "ready for processing).
  • Even though you don't want it, "being processed" will be essential to crash recovery scenarios (unless you can tolerate the same record being processed twice).

Alternatively

Consider using a transactional queue (MSMQ or Rabbit MQ come to mind). They are optimized for this very problem.

That would be my clear choice, having done both at massive scale.

Optimizing

If it takes a non-trivial amount of time to retrieve data from the database, you can consider a Producer/Consumer pattern, which is quite straightforward to implement with a BlockingCollection. That pattern allows one thread (producer) to populate a queue with DB records to be processed, and multiple other threads (consumers) to process items off of that queue.

A New Alternative

Given that several processing steps touch the record before it is considered complete, have a look at Windows Workflow Foundation as a possible alternative.

like image 187
Eric J. Avatar answered Oct 24 '22 21:10

Eric J.


I remember doing something like what you described...A thread checks from time to time if there is something new in database that needs to be processed. It will load only the new ids, so if at time x last id read is 1000, at x+1 will read from id 1001.

Everything it reads goes into a thread safe Queue. When items are added to this queue, you notify the working threads (maybe use autoreset events, or spawn threads here). each thread will read from this thread safe queue one item at a time, until the queue is emptied.

You should not assign before the work foreach thread (unless you know that foreach file the process takes the same amount of time). if a thread finishes the work, then it should take the load from the other ones left. using this thread safe queue, you make sure of this.

like image 25
Andrei Neagu Avatar answered Oct 24 '22 21:10

Andrei Neagu