I have a simple pub-sub setup on a mid-sized network, using ZMQ 2.1. Although some subscribers are using C# bindings, others are using Python bindings, and the issue I'm having is the same for either.
If I pull the network cable from a machine running a subscriber, I get an un-catchable error that immediately terminates that subscriber.
Here's a very simple example of a subscriber in Python (not actual production code, but enough to reproduce the problem):
import zmq
def main(server_address, port):
context = zmq.Context()
sub_socket = context.socket(zmq.SUB)
sub_socket.connect("tcp://" + server_address + ":" + str(port))
sub_socket.setsockopt(zmq.SUBSCRIBE, "KITH1S2")
while True:
msg = sub_socket.recv()
print msg
if __name__ == "__main__": main("company-intranet", 4000)
In C# the program simply terminates silently. In Python I at least get this:
Assertion failed: rc == 0 (....\src\zmq_connector.cpp:48)
This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information.
I've tried non-blocking versions, and poller versions, but in either case this instant termination problem persists. Is there something obvious I should be doing but I'm not? (That is, obvious to someone else :) ).
EDIT:
Found the following: https://zeromq.jira.com/browse/LIBZMQ-207
Seems as though it is/was a known issue.
That link further links to Github, where a change log for 2.1.10 has this note:
- Fixed issue 207, assertion failure in zmq_connecter.cpp:48, when an invalid zmq_connect() string was used, or the hostname could not be resolved. The zmq_connect() call now returns -1 in both those cases.
Although connect() does indeed throw an Invalid Argument exception in Python (not C# apparently?), recv() still fails. If the subscriber machine suddenly loses the network, that subscriber will simply stop functioning.
So - I'm going to try using IP addresses instead of named addresses to see if this will bypass the issue. Not ideal, but better than insta-crash.
If there is any interruption in the docker host's network, the zmq client in the docker container will stop receiving messages and it does not try to reconnect. Network capture shows that there is simply no more packets sent on the TCP connection in either direction. I will try the heartbeat solution mentioned above.
Just like ZMQ.REQ which can connect to multiple ZMQ.REP, ZMQ.SUB can connect to multiple ZMQ.PUB (publishers). No single publisher overwhelms the subscriber. The messages from both publishers are interleaved.
The subscribers usually sets a filter on these topics for topic of their interests. Subscribers are created with ZMQ.SUB socket types. You should notice that a zmq subscriber can connect to many publishers.
It is scenario #1 which is more interesting. Just like ZMQ.REQ which can connect to multiple ZMQ.REP, ZMQ.SUB can connect to multiple ZMQ.PUB (publishers). No single publisher overwhelms the subscriber.
Original question: Is there something obvious I should be doing but I'm not?
No.
The workaround for now is to use IP addressing. This does not cause program failure upon network disconnect for ZMQ 2.1.x.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With