Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ZeroMQ doesn't auto-reconnect

I've just downloaded and installed zeromq-4.0.5 on an Unbutu Precise (12.04) system. I've compiled the hello-world client (REQ, connect, 127.0.0.1) and server (REP, bind) written in C.

  1. I start the server.
  2. I start the client.
  3. Each second the client sends a message to the server, and receives a response.
  4. I press Ctrl-C to stop the server.
  5. The client tries to send its next outgoing message and it gets stuck in an never-returning epoll system call (as shown by strace).
  6. I restart the server.
  7. The zmq_recv call in the client is still stuck, even when the new server has been running for a minute. The only way to make progress for the client is to kill it (with Ctrl-C) and restart it.

Q1: Is this the expected behavior? I'd expect that in a few seconds the client should figure out that the server is running again, and it would auto-reconnect.

Q2: What should I change in the example code to fix this?

Q3: Am I using the wrong version of the software, or is something broken on my system?

I've disabled the firewall, sudo iptables -S prints -P INPUT ACCEPT; -P FORWARD ACCEPT; -P OUTPUT ACCEPT.

In the strace -f ./hwclient output I can see that the client is trying connect() 10 times a second (the default value of ZMQ_RECONNECT_IVL) after the server went down. On the strace -f ./hwserver output I can see that the restarted server accept()s the connection. However, communication gets stuck after that, and the server never receives the actual request from the client (but it notices when I kill the client; also the server receives requests from other clients which have been started after the server restart).

Using ipc:// instead of tcp:// causes the same behavior.

The auto-reconnect happens in successfully in zmq_send if the server has been killed before the client does the next zmq_send. However, when the server gets killed while the client is running zmq_recv, then the zmq_recv blocks indefinitely, and the client can't seem to recover from that.

I've found this article, which recommends using timeouts. However, I think that timeouts can't be the right solution, because the TCP disconnect notification is already available in the client process, and it's already acting on it -- it just doesn't make zmq_recv resend the request to the new server -- or at least return early indicating an error.

like image 942
pts Avatar asked Oct 24 '14 22:10

pts


People also ask

How do I know if my Zmq socket is connected?

No, there's no method in the API to check if a socket is connected. ZeroMq abstracts the network; client and server connections are completely transparent to the peer making the connection.

Does ZeroMQ use sockets?

ZeroMQ patterns are implemented by pairs of sockets with matching types. The built-in core ZeroMQ patterns are: Request-reply, which connects a set of clients to a set of services. This is a remote procedure call and task distribution pattern.

Is Zmq asynchronous?

Async-zmq is high-level bindings for zmq in asynchronous manner which is compatible to every async runtime. No need for configuring or tuning features.

Is ZeroMQ a TCP?

ZeroMQ supports common messaging patterns (pub/sub, request/reply, client/server and others) over a variety of transports (TCP, in-process, inter-process, multicast, WebSocket and more), making inter-process messaging as simple as inter-thread messaging. This keeps your code clear, modular and extremely easy to scale.


2 Answers

You may having the same issue that zemomq just fixed for me in 4.0.6 (issue 1362). Basically, the subscriber socket wouldn't always resend it's filter back over during a reconnection (an empty filter means no messages from publisher to that subscriber). The only way to recover was to restart the client's application. Their fix seems to have done the job. The issue was really highlighted when using a transport (like stunnel) to tunnel the connections. Without 4.0.6, I was able to get around the issue by setting the "immediate" flag on the subscriber socket.

like image 53
user4599197 Avatar answered Sep 21 '22 11:09

user4599197


A3: No.

A2: Do not expect demo to have a design for fault-resilient operations

A1: Yes.


Where to go for more details?

A best next step you may do for this is IMHO to get a bit more global view, which may sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of the Code Connected, Volume 1 [asPdf->], if it were not the case of reading step-by-step there.

The fastest-ever learning-curve would be to have first an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair for a possible High-availability approach and then go back to the roots, elements and details. enter image description here

like image 22
user3666197 Avatar answered Sep 19 '22 11:09

user3666197