I have a problem with something and I'm guessing it's the code.
The application is used to 'ping' some custom made network devices to check if they're alive. It pings them every 20 seconds with a special UDP packet and expects a response. If they fail to answer 3 consecutive pings the application sends a warning message to the staff.
The application is running 24/7 and for a random number of times a day (2-5 mostly) the application fails to receive UDP packets for an exact time of 10 minutes, after which everything goes back to normal. During those 10 minutes only 1 device seems to be replying, others seem dead. That I've been able to deduce from the logs.
I've used wireshark to sniff the packets and I've verified that ping packets are going both out AND in, so the network part seems to be working okay, all the way to the OS. The computers are running WinXPPro and some have no configured firewall whatsoever. I'm having this issue on different computers, different windows installs and different networks.
I'm really at a loss as to what might be the problem here.
I'm attaching the relevant part of the code which does all the network. This is run in a separate thread from the rest of the application.
I thank you in advance for whatever insight you might provide.
def monitor(self):
checkTimer = time()
while self.running:
read, write, error = select.select([self.commSocket],[self.commSocket],[],0)
if self.commSocket in read:
try:
data, addr = self.commSocket.recvfrom(1024)
self.processInput(data, addr)
except:
pass
if time() - checkTimer > 20: # every 20 seconds
checkTimer = time()
if self.commSocket in write:
for rtc in self.rtcList:
try:
addr = (rtc, 7) # port 7 is the echo port
self.commSocket.sendto('ping',addr)
if not self.rtcCheckins[rtc][0]: # if last check was a failure
self.rtcCheckins[rtc][1] += 1 # incr failure count
self.rtcCheckins[rtc][0] = False # setting last check to failure
except:
pass
for rtc in self.rtcList:
if self.rtcCheckins[rtc][1] > 2: # didn't answer for a whole minute
self.rtcCheckins[rtc][1] = 0
self.sendError(rtc)
You don't mention it, so I have to remind you that since you are using select()
that socket better be non-blocking. Otherwise your recvfrom()
can block. Should not really happen when dealt with properly, but hard to tell from the short code snippet.
Then you don't have to check UDP socket for writability - it is always writable.
Now for the real problem - you are saying that packets are entering the system, but your code does not receive them. This is most probably due to the overflow of the socket receive buffer. Did the number of ping targets increase over those last 15 years? You are setting yourself up for a ping-response storm, and probably not reading those responses fast enough, so they pile up in the receive buffer and eventually get dropped.
My suggestions in order of ROI:
SO_RCVBUF
to a large value. This will allow your network stack to better deal with packet bursts.EWOULDBLOCK
. This would save you bunch of select()
calls.recvmmsg(2)
, if such thing exists, to dequeue multiple packets per syscall.Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With