I have a few multi-core computers connected by Infiniband network. I would like to have some low-latency computation on a pool of shared memory, with remote atomic operations. I know RDMA is the way to go. On each node I would register a memory region (and protection domain) for data sharing.
The online RDMA examples often focus at a single connection between a single-threaded server and a single-threaded client. Now I would like to have a multi-threaded process on each of the Infiniband node. I am very puzzled about the following...
How many queue pairs should I prepare on each node, for a cluster of n nodes and m threads in total? To be more specific, can multiple threads on the same node share the same queue pair?
How many completion queues should I prepare on each node? I will have multiple threads issuing remote read/write/cas operations on each node. If they were to share a common completion queue, the completion events will be mixed up. If the threads have their own separated completion queues, there would be really a lot of them.
Do you suggest me to have any existing libraries instead of writing this software? (hmm, or I should write one and open-source it? :-)
Thank you for your kind suggestion(s).
RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system.
The advantages of RDMA is the low latency transfer of information between compute nodes at the memory-to-memory level, without burdening the CPU. This transfer function is offloaded to the network adapter hardware in order to bypass the operating system software network stack.
Remote Direct Memory Access is a technology that enables two networked computers to exchange data in main memory without relying on the processor, cache or operating system of either computer.
RDMA provides both reliable/unreliable and connected/un- connected transport mode for applications. Specifically, there are three modes in RDMA: Reliable Connected (RC), Un- reliable Connected (UC), and Unreliable Datagram (UD).
On Linux at least, the InfiniBand verbs library is completely thread-safe. So you can use as many or as few queue pairs (QPs) in your multi-threaded app as you want -- multiple threads can post work requests to a single QP safely, although of course you will have to make sure whatever tracking of outstanding requests, etc. that you do in your own application is thread-safe.
It is true that each send queue and each receive queue (remember that QP is really a pair of queues :) is attached to a single completion queue (CQ). So if you want each thread to have its own CQ then each thread will need its own QP to submit work into.
In general QPs and CQs are not really a limited resource -- you can easily have hundreds or thousands on a single node without trouble. So you can design your app without worrying too much about the absolute number of queues you're using. This is not to say you don't have to worry about scalability -- for example if you have a lot of receive queues and a lot of buffers per queue, then you may tie up too much memory in receive buffering, so you end up needing to use shared receive queues (SRQs).
There are a number of middleware libraries that use IB; probably MPI (eg http://open-mpi.org/) is the best-known one, and it's probably worth evaluating that before you get too far into reinventing things. The MPI developers have also published a lot of research about using IB/RDMA efficiently, which is probably worth seeking out in case you do decide to build your own system.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With