Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Heavy Asynchronous Processing

Tags:

java

c#

msmq

I have an application, in it's simplest form, it reads a large number of phone numbers from a database (about 15 million) and sends each number off one line at a time to a url for processing. I designed the application like this:

  1. bulk export the phone numbers from sql to a text file using SSIS. This is very quick and a matter of 1 or 2 minutes.
  2. load the numbers into a message queue (I use MSMQ at the moment).
  3. Dequeue the messages from a command line application and fire up the request over http to some service, like 3 calls per phone number and finally log to a database.

Problem is: It still takes a long time to complete. MSMQ also has a limit on the size of messages it can take and now I have to create multiple message queues. I need a lot of fault tolerance but I dare not make my message queue transactional because of performance. I'm thinking of publishing the message queue (currently a private queue) to the active directory so that the processes can dequeue it from different systems so this can complete quicker. Also, my processors hit 100% during execution and I'm changing it to use a threadpool at this time. I'm willing to explore JMS right now if it will handle the queue better. So far, the most efficient part of the whole processing is the SSIS part.

I'll like to hear better design approach, especially if you've handled this kind of volume before. I'm ready to switch to unix or do lisp if it handles this kinda situation better.

Thanks.

like image 380
keni Avatar asked Nov 06 '22 11:11

keni


1 Answers

Here is a simple super pragmatic solution:

First split your text file into smaller files, perhaps with something like 10,000 entries in each file. Lets call them numbers_x.queue.

Create a threadpool based app where each thread processes the files using the following steps:

  1. Look for a file called numbers_x.done if it exists find the last full number in it.
  2. If you found a .done file scan through numbers_x.queue to position yourself at the number after the last in the .done file.
  3. Read a number from the .queue file
  4. Do your web api calls
  5. Do your logging
  6. Append the number to the .done file
  7. If the .queue file is not at the end yet, goto 3
  8. Delete the queue file, then the done file
  9. Grab another unprocessed .queue file and continue from 1

While this is a pretty crude approach, it is super easy to implement, pretty fault tolerant and you can easily split the .queue files between a set of servers and have them work in parallel.

like image 65
kasperjj Avatar answered Nov 15 '22 07:11

kasperjj