I have 100 servers
in my cluster.
At time 17:35:00
, all 100 servers
are provided with data (of size 1[MB]
). Each server processes the data, and produces an output of about 40[MB]
. The processing time for each server is 5[sec]
.
At time 17:35:05
(5[sec] later
), there's a need for a central machine to read all the output from all 100 servers
(remember, the total size of data is: 100 [machines] x 40 [MB] ~ 4[GB]), aggregate it, and produce an output.
It is of high importance that the entire process of gathering the 4[GB] data
from all 100 servers
takes as little time as possible. How do I go about solving this problem?
Are there any existing tools (ideally, in python
, but would consider other solutions) that can help?
Look at the flow of data in your application, and then look at the data rates that your (I assume shared) disk system provides and the rate your GigE interconnect provides, and the topology of your cluster. Which of these is a bottleneck?
GigE provides theoretical maximum 125 MB/s transmission rate between nodes - thus 4GB will take ~30s to move 100 40MB chunks of data into your central node from the 100 processing nodes over GigE.
A file system shared between all your nodes provides an alternative to over-Ethernet RAM to RAM data transfer.
If your shared file system is fast at the disk read/write level (say: a bunch of many-disk RAID 0 or RAID 10 arrays aggregated into a Lustre F/S or some such) and it uses 20Gb/s or 40 Gb/s interconnect btwn block storage and nodes, then 100 nodes each writing a 40MB file to disk and the central node reading those 100 files may be faster than transferring the 100 40 MB chunks over the GigE node to node interconnect.
But if your shared file system is a RAID 5 or 6 array exported to the nodes via NFS over GigE Ethernet, that will be slower than RAM to RAM transfer via GigE using RPC or MPI because you have to write and read the disks over GigE anyway.
So, there have been some good answers and discussion or your question. But we do (did) not know your node interconnect speed, and we do not know how your disk is set up (shared disk, or one disk per node), or whether shared disk has it's own interconnect and what speed that is.
Node interconnect speed is now known. It is no longer a free variable.
Disk set up (shared/not-shared) is unknown, thus a free variable.
Disk interconnect (assuming shared disk) is unknown, thus another free variable.
How much RAM does your central node have is unknown (can it hold 4GB data in RAM?) thus is a free variable.
If everything including shared disk uses the same GigE interconnect then it is safe to say that 100 nodes each writing a 40MB file to disk and then the central node reading 100 40MB files from disk is the slowest way to go. Unless your central node cannot allocate 4GB RAM without swapping, in which case things probably get complicated.
If your shared disk is high performance it may be the case that it is faster for 100 nodes to each write a 40MB file, and for the central node to read 100 40MB files.
Use rpyc
. It's mature and actively maintained.
Here's their blurb about what it does:
RPyC (IPA:/ɑɹ paɪ siː/, pronounced like are-pie-see), or Remote Python Call, is a transparent and symmetrical python library for remote procedure calls, clustering and distributed-computing. RPyC makes use of object-proxying, a technique that employs python's dynamic nature, to overcome the physical boundaries between processes and computers, so that remote objects can be manipulated as if they were local.
David Mertz has a quick introduction to RPyC at IBM developerWorks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With