Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Python UDP socket semi-randomly failing to receive

I have a problem with something and I'm guessing it's the code.

The application is used to 'ping' some custom made network devices to check if they're alive. It pings them every 20 seconds with a special UDP packet and expects a response. If they fail to answer 3 consecutive pings the application sends a warning message to the staff.

The application is running 24/7 and for a random number of times a day (2-5 mostly) the application fails to receive UDP packets for an exact time of 10 minutes, after which everything goes back to normal. During those 10 minutes only 1 device seems to be replying, others seem dead. That I've been able to deduce from the logs.

I've used wireshark to sniff the packets and I've verified that ping packets are going both out AND in, so the network part seems to be working okay, all the way to the OS. The computers are running WinXPPro and some have no configured firewall whatsoever. I'm having this issue on different computers, different windows installs and different networks.

I'm really at a loss as to what might be the problem here.

I'm attaching the relevant part of the code which does all the network. This is run in a separate thread from the rest of the application.

I thank you in advance for whatever insight you might provide.

def monitor(self):
    checkTimer = time()
    while self.running:
        read, write, error = select.select([self.commSocket],[self.commSocket],[],0)
        if self.commSocket in read:
                data, addr = self.commSocket.recvfrom(1024)
                self.processInput(data, addr)

        if time() - checkTimer > 20: # every 20 seconds
            checkTimer = time()
            if self.commSocket in write:
                for rtc in self.rtcList:
                        addr = (rtc, 7) # port 7 is the echo port
                        if not self.rtcCheckins[rtc][0]: # if last check was a failure
                            self.rtcCheckins[rtc][1] += 1 # incr failure count
                        self.rtcCheckins[rtc][0] = False # setting last check to failure

        for rtc in self.rtcList:
            if self.rtcCheckins[rtc][1] > 2: # didn't answer for a whole minute
                self.rtcCheckins[rtc][1] = 0
like image 883
flowInTheDark Avatar asked Jul 18 '12 07:07


1 Answers

You don't mention it, so I have to remind you that since you are using select() that socket better be non-blocking. Otherwise your recvfrom() can block. Should not really happen when dealt with properly, but hard to tell from the short code snippet.

Then you don't have to check UDP socket for writability - it is always writable.

Now for the real problem - you are saying that packets are entering the system, but your code does not receive them. This is most probably due to the overflow of the socket receive buffer. Did the number of ping targets increase over those last 15 years? You are setting yourself up for a ping-response storm, and probably not reading those responses fast enough, so they pile up in the receive buffer and eventually get dropped.

My suggestions in order of ROI:

  • Spread out ping requests, don't set yourself up for a DDOS. Query, say, one system per iteration and keep last check time per target. This will allow you to even out the number of packets out and in.
  • Increase SO_RCVBUF to a large value. This will allow your network stack to better deal with packet bursts.
  • Read packets in a loop, i.e. once your UDP socket is readable (assuming it's non-blocking), read until you get EWOULDBLOCK. This would save you bunch of select() calls.
  • See if you can use some advanced Windows API along the lines of Linux recvmmsg(2), if such thing exists, to dequeue multiple packets per syscall.

Hope this helps.

like image 64
Nikolai Fetissov Avatar answered Oct 24 '22 05:10

Nikolai Fetissov