Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Anatomy of a Distributed System in PHP

I've a problem which is giving me some hard time trying to figure it out the ideal solution and, to better explain it, I'm going to expose my scenario here.

I've a server that will receive orders from several clients. Each client will submit a set of recurring tasks that should be executed at some specified intervals, eg.: client A submits task AA that should be executed every minute between 2009-12-31 and 2010-12-31; so if my math is right that's about 525 600 operations in a year, given more clients and tasks it would be infeasible to let the server process all these tasks so I came up with the idea of worker machines. The server will be developed on PHP.

Worker machines are just regular cheap Windows-based computers that I'll host on my home or at my workplace, each worker will have a dedicated Internet connection (with dynamic IPs) and a UPS to avoid power outages. Each worker will also query the server every 30 seconds or so via web service calls, fetch the next pending job and process it. Once the job is completed the worker will submit the output to the server and request a new job and so on ad infinitum. If there is a need to scale the system I should just set up a new worker and the whole thing should run seamlessly. The worker client will be developed in PHP or Python.

At any given time my clients should be able to log on to the server and check the status of the tasks they ordered.

Now here is where the tricky part kicks in:

  • I must be able to reconstruct the already processed tasks if for some reason the server goes down.
  • The workers are not client-specific, one worker should process jobs for any given number of clients.

I've some doubts regarding the general database design and which technologies to use.

Originally I thought of using several SQLite databases and joining them all on the server but I can't figure out how I would group by clients to generate the job reports.

I've never actually worked with any of the following technologies: memcached, CouchDB, Hadoop and all the like, but I would like to know if any of these is suitable for my problem, and if yes which do you recommend for a newbie is "distributed computing" (or is this parallel?) like me. Please keep in mind that the workers have dynamic IPs.

Like I said before I'm also having trouble with the general database design, partly because I still haven't chosen any particular R(D)DBMS but one issue that I've and I think it's agnostic to the DBMS I choose is related to the queuing system... Should I precalculate all the absolute timestamps to a specific job and have a large set of timestamps, execute and flag them as complete in ascending order or should I have a more clever system like "when timestamp modulus 60 == 0 -> execute". The problem with this "clever" system is that some jobs will not be executed in order they should be because some workers could be waiting doing nothing while others are overloaded. What do you suggest?

PS: I'm not sure if the title and tags of this question properly reflect my problem and what I'm trying to do; if not please edit accordingly.

Thanks for your input!

@timdev:

  1. The input will be a very small JSON encoded string, the output will also be a JSON enconded string but a bit larger (in the order of 1-5 KB).
  2. The output will be computed using several available resources from the Web so the main bottleneck will probably be the bandwidth. Database writes may also be one - depending on the R(D)DBMS.
like image 876
Alix Axel Avatar asked Oct 04 '09 17:10

Alix Axel


3 Answers

It looks like you're on the verge of recreating Gearman. Here's the introduction for Gearman:

Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work. It allows you to do work in parallel, to load balance processing, and to call functions between languages. It can be used in a variety of applications, from high-availability web sites to the transport of database replication events. In other words, it is the nervous system for how distributed processing communicates.

You can write both your client and the back-end worker code in PHP.


Re your question about a Gearman Server compiled for Windows: I don't think it's available in a neat package pre-built for Windows. Gearman is still a fairly young project and they may not have matured to the point of producing ready-to-run distributions for Windows.

Sun/MySQL employees Eric Day and Brian Aker gave a tutorial for Gearman at OSCON in July 2009, but their slides mention only Linux packages.

Here's a link to the Perl CPAN Testers project, that indicates that Gearman-Server can be built on Win32 using the Microsoft C compiler (cl.exe), and it passes tests: http://www.nntp.perl.org/group/perl.cpan.testers/2009/10/msg5521569.html But I'd guess you have to download source code and build it yourself.

like image 192
Bill Karwin Avatar answered Nov 03 '22 10:11

Bill Karwin


Gearman seems like the perfect candidate for this scenario, you might even want to virtualize you windows machines to multiple worker nodes per machine depending on how much computing power you need.

Also the persistent queue system in gearman prevents jobs getting lost when a worker or the gearman server crashes. After a service restart the queue just continues where it has left off before crash/reboot, you don't have to take care of all this in your application and that is a big advantage and saves alot of time/code

Working out a custom solution might work but the advantages of gearman especially the persistent queue seem to me that this might very well be the best solution for you at the moment. I don't know about a windows binary for gearman though but i think it should be possible.

like image 4
ChrisR Avatar answered Nov 03 '22 10:11

ChrisR


A simpler solution would be to have a single database with multiple php-nodes connected. If you use a proper RDBMS (MSql + InnoDB will do), you can have one table act as a queue. Each worker will then pull tasks from that to work on and write it back into the database upon completion, using transactions and locking to synchronise. This depends a bit on the size of input/output data. If it's large, this may not be the best scheme.

like image 3
troelskn Avatar answered Nov 03 '22 09:11

troelskn