Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Please recommend an alternative to Microsoft HPC [closed]

We aim to implement a distributed system on a cluster, which will perform resource-consuming image-based computing with heavy storage I/O, having following characteristics:

  1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.
  2. It is built around job-task concept. A job may have one to 100,000 tasks.
  3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.
  4. Tasks create other tasks on the fly.
  5. Some tasks may run for minutes, while others may take many hours.
  6. The tasks run according to a dependency hierarchy, which may be updated on the fly.
  7. The job may be paused and resumed later.
  8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.
  9. The tasks tell their progress and result back to the manager.
  10. The manager is aware if the task is alive or hanged.

We found Windows HPC Server 2008 (HPCS) R2 very close by concept to what we need. However, there are a few critical downsides:

  1. Creation of tasks is getting exponentially slower with increasing number of tasks. Submitting more than several thousands of tasks is unbearable in terms of time.
  2. Task is unable to report its progress back to the manager, only job can.
  3. There is no communication with the task during its runtime, which makes it impossible to check if the task is running or may need restarting.
  4. HPCS only knows nodes, CPU cores and memory as resource units. We can't introduce resource units of our own (like free disk space, custom hardware devices, etc).

Here's my question: does anybody know and/or had experience with a distributed computing framework which could help us? We are using Windows.

like image 654
Pavel Radzivilovsky Avatar asked Jun 30 '10 12:06

Pavel Radzivilovsky


2 Answers

I would take a look at the Condor high throughput computing project. It supports windows (and linux, and OSX) clients and servers, handles complex dependencies between tasks using DAGman and can suspend (and even move) tasks. I've experience of systems based on Condor that scale to thousands of machines across university campuses.

like image 118
Andrew Walker Avatar answered Oct 05 '22 13:10

Andrew Walker


Platform LSF will do everything you need. It runs on Windows. It is commercial, and can be purchased with support.

Yes. 1. There is a dedicated manager computer node and up to 100 compute nodes. The cluster must be easily expandable.

Yes 2. It is built around job-task concept. A job may have one to 100,000 tasks.

Yes 3. A job, which is initiated by the user on the manager node, results in creation of tasks on the compute node.

Yes 4. Tasks create other tasks on the fly.

Yes 5. Some tasks may run for minutes, while others may take many hours.

Yes 6. The tasks run according to a dependency hierarchy, which may be updated on the fly.

Yes 7. The job may be paused and resumed later.

Yes 8. Each task requires specific resources in terms of CPU (cores), memory and local hard disk space. The manager should be aware of this when scheduling tasks.

Yes 9. The tasks tell their progress and result back to the manager.

Yes 10. The manager is aware if the task is alive or hanged.

like image 32
Stan Graves Avatar answered Oct 05 '22 11:10

Stan Graves