I've just downloaded and installed zeromq-4.0.5 on an Unbutu Precise (12.04) system. I've compiled the hello-world client (<code>REQ</code>, connect, 127.0.0.1) and server (<code>REP</code>, bind) written in C. <ol> <li>I start the server.</li> <li>I start the client.</li> <li>Each second the client sends a message to the server, and receives a response.</li> <li>I press Ctrl-C to stop the server.</li> <li>The client tries to send its next outgoing message and it gets stuck in an never-returning epoll system call (as shown by strace).</li> <li>I restart the server.</li> <li>The <code>zmq_recv</code> call in the client is still stuck, even when the new server has been running for a minute. The only way to make progress for the client is to kill it (with Ctrl-C) and restart it.</li> </ol> Q1: Is this the expected behavior? I'd expect that in a few seconds the client should figure out that the server is running again, and it would auto-reconnect. Q2: What should I change in the example code to fix this? Q3: Am I using the wrong version of the software, or is something broken on my system? I've disabled the firewall, <code>sudo iptables -S</code> prints <code>-P INPUT ACCEPT</code>; <code>-P FORWARD ACCEPT</code>; <code>-P OUTPUT ACCEPT</code>. In the <code>strace -f ./hwclient</code> output I can see that the client is trying <code>connect()</code> 10 times a second (the default value of <code>ZMQ_RECONNECT_IVL</code>) after the server went down. On the <code>strace -f ./hwserver</code> output I can see that the restarted server <code>accept()</code>s the connection. However, communication gets stuck after that, and the server never receives the actual request from the client (but it notices when I kill the client; also the server receives requests from other clients which have been started after the server restart). Using <code>ipc://</code> instead of <code>tcp://</code> causes the same behavior. The auto-reconnect happens in successfully in <code>zmq_send</code> if the server has been killed before the client does the next <code>zmq_send</code>. However, when the server gets killed while the client is running <code>zmq_recv</code>, then the <code>zmq_recv</code> blocks indefinitely, and the client can't seem to recover from that. I've found this article, which recommends using timeouts. However, I think that timeouts can't be the right solution, because the TCP disconnect notification is already available in the client process, and it's already acting on it -- it just doesn't make <code>zmq_recv</code> resend the request to the new server -- or at least return early indicating an error.

<h3>A3: No.</h3> <h3>A2: Do not expect demo to have a design for fault-resilient operations</h3> <h3>A1: Yes.</h3> <hr> <h3>Where to go for more details?</h3> A best next step you may do for this is IMHO to get a bit more global view, which may sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of the Code Connected, Volume 1 [asPdf->], if it were not the case of reading step-by-step there. The fastest-ever learning-curve would be to have first an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair for a possible High-availability approach and then go back to the roots, elements and details. <img src="https://i.stack.imgur.com/w9KeO.gif" alt="enter image description here">

ZeroMQ doesn't auto-reconnect

Tags:

c

tcp

zeromq

reconnect

I've just downloaded and installed zeromq-4.0.5 on an Unbutu Precise (12.04) system. I've compiled the hello-world client (REQ, connect, 127.0.0.1) and server (REP, bind) written in C.

I start the server.
I start the client.
Each second the client sends a message to the server, and receives a response.
I press Ctrl-C to stop the server.
The client tries to send its next outgoing message and it gets stuck in an never-returning epoll system call (as shown by strace).
I restart the server.
The zmq_recv call in the client is still stuck, even when the new server has been running for a minute. The only way to make progress for the client is to kill it (with Ctrl-C) and restart it.

Q1: Is this the expected behavior? I'd expect that in a few seconds the client should figure out that the server is running again, and it would auto-reconnect.

Q2: What should I change in the example code to fix this?

Q3: Am I using the wrong version of the software, or is something broken on my system?

I've disabled the firewall, sudo iptables -S prints -P INPUT ACCEPT; -P FORWARD ACCEPT; -P OUTPUT ACCEPT.

In the strace -f ./hwclient output I can see that the client is trying connect() 10 times a second (the default value of ZMQ_RECONNECT_IVL) after the server went down. On the strace -f ./hwserver output I can see that the restarted server accept()s the connection. However, communication gets stuck after that, and the server never receives the actual request from the client (but it notices when I kill the client; also the server receives requests from other clients which have been started after the server restart).

Using ipc:// instead of tcp:// causes the same behavior.

The auto-reconnect happens in successfully in zmq_send if the server has been killed before the client does the next zmq_send. However, when the server gets killed while the client is running zmq_recv, then the zmq_recv blocks indefinitely, and the client can't seem to recover from that.

I've found this article, which recommends using timeouts. However, I think that timeouts can't be the right solution, because the TCP disconnect notification is already available in the client process, and it's already acting on it -- it just doesn't make zmq_recv resend the request to the new server -- or at least return early indicating an error.

942

asked Oct 24 '14 22:10

pts

2 Answers

You may having the same issue that zemomq just fixed for me in 4.0.6 (issue 1362). Basically, the subscriber socket wouldn't always resend it's filter back over during a reconnection (an empty filter means no messages from publisher to that subscriber). The only way to recover was to restart the client's application. Their fix seems to have done the job. The issue was really highlighted when using a transport (like stunnel) to tunnel the connections. Without 4.0.6, I was able to get around the issue by setting the "immediate" flag on the subscriber socket.

answered Sep 21 '22 11:09

user4599197

A3: No.

A2: Do not expect demo to have a design for fault-resilient operations

A1: Yes.

Where to go for more details?

A best next step you may do for this is IMHO to get a bit more global view, which may sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of the Code Connected, Volume 1 [asPdf->], if it were not the case of reading step-by-step there.

The fastest-ever learning-curve would be to have first an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair for a possible High-availability approach and then go back to the roots, elements and details. enter image description here

answered Sep 19 '22 11:09

user3666197

Related questions
                            
                                Coin flip simulation never exceeding a streak of 15 heads
                            
                                calloc() slower than malloc() & memset()
                            
                                Segmentation fault from a function that is not called at all
                            
                                setuid on an executable doesn't seem to work
                            
                                Can a pointer point to an address after 4GB?
                            
                                Is there a way to guarantee alignment of members of a malloc()-ed structs
                            
                                Faking an IO Error on Linux
                            
                                bit count function in K&R [closed]
                            
                                STDERR_FILENO undeclared on ubuntu
                            
                                Least significant bits in function pointer
                            
                                Guaranteed precision of sqrt function in C/C++
                            
                                Ubuntu - #include <curl/curl.h> no such file or directory
                            
                                sleep function in C11
                            
                                gcc on Windows: generated "a.exe" file vanishes
                            
                                Why the int type takes up 8 bytes in BSS section but 4 bytes in DATA section
                            
                                Custom malloc implementation
                            
                                Gcc inline assembly what does "'asm' operand has impossible constraints" mean?
                            
                                How do I scrape a web page using C?
                            
                                Passing an array as a function argument from within a function which takes it as an argument in C
                            
                                How to differentiate '-' operator from a negative number for a tokenizer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With