I have a python script which goes off and makes a number of HTTP and urllib requests to various domains.
We have a huge amount of domains to processes and need to do this as quickly as possible. As HTTP requests are slow (i.e. they could time out of there is no website on the domain) I run a number of the scripts at any one time feeding them from a domains list in the database.
The problem I see is over a period of time (a few hours to 24 hours) the scripts all start to slow down and ps -al shows they are sleeping.
The servers are very powerful (8 cores, 72GB ram, 6TB Raid 6 etc etc 80MB 2:1 connection) and are never maxed out, i.e. Free -m
shows
-/+ buffers/cache: 61157 11337
Swap: 4510 195 4315
Top shows between 80-90% idle
sar -d shows average 5.3% util
and more interestingly iptraf starts off at around 50-60MB/s and ends up 8-10MB/s after about 4 hours.
I am currently running around 500 versions of the script on each server (2 servers) and they both show the same problem.
ps -al
shows that most of the python scripts are sleeping which I don't understand why
for instance:
0 S 0 28668 2987 0 80 0 - 71003 sk_wai pts/2 00:00:03 python
0 S 0 28669 2987 0 80 0 - 71619 inet_s pts/2 00:00:31 python
0 S 0 28670 2987 0 80 0 - 70947 sk_wai pts/2 00:00:07 python
0 S 0 28671 2987 0 80 0 - 71609 poll_s pts/2 00:00:29 python
0 S 0 28672 2987 0 80 0 - 71944 poll_s pts/2 00:00:31 python
0 S 0 28673 2987 0 80 0 - 71606 poll_s pts/2 00:00:26 python
0 S 0 28674 2987 0 80 0 - 71425 poll_s pts/2 00:00:20 python
0 S 0 28675 2987 0 80 0 - 70964 sk_wai pts/2 00:00:01 python
0 S 0 28676 2987 0 80 0 - 71205 inet_s pts/2 00:00:19 python
0 S 0 28677 2987 0 80 0 - 71610 inet_s pts/2 00:00:21 python
0 S 0 28678 2987 0 80 0 - 71491 inet_s pts/2 00:00:22 python
There is no sleep state in the script that gets executed so I can't understand why ps -al shows most of them asleep and why they should get slower and slower making less IP requests over time when CPU, memory, disk access and bandwidth are all available in abundance.
If anyone could help I would be very grateful.
EDIT:
The code is massive as I am using exceptions through it to catch diagnostics about the domain, i.e. reasons I can't connect. Will post the code somewhere if needed, but the fundamental calls via HTTPLib and URLLib are straight off the python examples.
More info:
Both
quota -u mysql quota -u root
come back with nothing
nlimit -n comes back with 1024 have change limit.conf to allow mysql to allow 16000 soft and hard connections and am able to running over 2000 script so far but still still the problem.
Ok, so I have changed all the limits for the user, ensured all sockets are closed (they were not) and although things are better, I am still getting a slow down although not as bad.
Interestingly I have also noticed some memory leak - the scripts use more and more memory the longer they run, however I am not sure what is causing this. I store output data in a string and then print it to the terminal after every iteration, I do clear the string at the end too but could the ever increasing memory be down to the terminal storing all the output?
Edit: No seems not - ran up 30 scripts without outputting to terminal and still the same leak. I'm not using anything clever (just strings, HTTPlib and URLLib) - wonder if there are any issues with the python mysql connector...?
Check the ulimit
and quota
for the box and the user running the scripts. /etc/security/limits.conf
may also contain resource restrictions that you might want to modify.
ulimit -n
will show the max number of open file descriptors allowed.
You can also check the fd's with ls -l /proc/[PID]/fd/
where [PID]
is the process id of one of the scripts.
Would need to see some code to tell what's really going on..
Edit (Importing comments and more troubleshooting ideas):
Can you show the code where your opening and closing the connections?
When just run a few script processes are running, do they too start to go idle after a while? Or is it only when there are several hundred+ running at once that this happens?
Is there a single parent process that starts all of these scripts?
If your using s = urllib2.urlopen(someURL)
, make sure to s.close()
when your done with it. Python can often close things down for you (like if your doing x = urllib2.urlopen(someURL).read()
), but it will leave that to you if you if told to (such as assigning a variable to the return value of .urlopen()
). Double check your opening and closing of urllib calls (or all I/O code to be safe). If each script is designed to only have 1 open socket at a time, and your /proc/PID/fd
is showing multiple active/open sockets per script process, then there is definitely a code issue to fix.
ulimit -n
showing 1024
is giving the limit of open socket/fd's that the mysql user can have, you can change this with ulimit -S -n [LIMIT_#]
but check out this article first:
Changing process.max-file-descriptor using 'ulimit -n' can cause MySQL to change table_open_cache value.
You may need to log out and shell back in after. And/Or add it to /etc/bashrc
(don't forget to source /etc/bashrc
if you change bashrc
and don't want to log out/in).
Disk space is another thing that I have found out (the hard way) can cause very weird issues. I have had processes act like they are running (not zombied) but not doing what is expected because they had open handles to a log file on a partition with zero disk space left.
netstat -anpTee | grep -i mysql
will also show if these sockets are connected/established/waiting to be closed/waiting on timeout/etc.
watch -n 0.1 'netstat -anpTee | grep -i mysql'
to see the sockets open/close/change state/etc in real time in a nice table output (may need to export GREP_OPTIONS=
first if you have it set to something like --color=always
).
lsof -u mysql
or lsof -U
will also show you open FD's (the output is quite verbose).
import urllib2
import socket
socket.settimeout(15)
# or settimeout(0) for non-blocking:
#In non-blocking mode (blocking is the default), if a recv() call
# doesn’t find any data, or if a send() call can’t
# immediately dispose of the data,
# a error exception is raised.
#......
try:
s = urllib2.urlopen(some_url)
# do stuff with s like s.read(), s.headers, etc..
except (HTTPError, etcError):
# myLogger.exception("Error opening: %s!", some_url)
finally:
try:
s.close()
# del s - although, I don't know if deleting s will help things any.
except:
pass
Some man pages and reference links:
- ulimit
- quota
- limits.conf
- fork bomb
- Changing process.max-file-descriptor using 'ulimit -n' can cause MySQL to change table_open_cache value
- python socket module
- lsof
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With