I wrote about 50 classes that I use to connect and work with websites using mechanize and threading. They all work concurrently, but they don't depend on each other. So that means 1 class - 1 website - 1 thread. It's not particularly elegant solution, especially for managing the code, since lot of the code repeats in each class (but not nearly enough to make it into one class to pass arguments, as some sites may require additional processing of retrieved data in middle of methods - like 'login' - that others might not need). As I said, it's not elegant -- But it works. Needless to say I welcome all recommendations how to write this better without using 1 class for each website approach. Adding additional functionality or overall code management of each class is a daunting task.
However, I found out, that each thread takes about 8MB memory, so with 50 running threads we are looking at about 400MB usage. If it was running on my system I wouldn't have problem with that, but since it's running on a VPS with only 1GB memory, it's starting to be an issue. Can you tell me how to reduce the memory usage, or are there any other way to to work with multiple sites concurrently?
I used this quick test python program to test if it's the data stored in variables of my application that is using the memory, or something else. As you can see in following code, it's only processing sleep() function, yet each thread is using 8MB of memory.
from thread import start_new_thread
from time import sleep
def sleeper():
try:
while 1:
sleep(10000)
except:
if running: raise
def test():
global running
n = 0
running = True
try:
while 1:
start_new_thread(sleeper, ())
n += 1
if not (n % 50):
print n
except Exception, e:
running = False
print 'Exception raised:', e
print 'Biggest number of threads:', n
if __name__ == '__main__':
test()
When I run this, the output is:
50
100
150
Exception raised: can't start new thread
Biggest number of threads: 188
And by removing running = False
line, I can then measure free memory using free -m
command in shell:
total used free shared buffers cached
Mem: 1536 1533 2 0 0 0
-/+ buffers/cache: 1533 2
Swap: 0 0 0
The actual calculation why I know it's taking about 8MB per thread is then simple by dividing dividing the difference of memory used before and during the the above test application is running, divided by maximum threads it managed to start.
It's probably only allocated memory, because by looking at top
, the python process uses only about 0.6% of memory.
futures
on Python 2.xUsing "one thread per request" is OK and easy for many use-cases. However, it will require a lot of ressources (as you experienced).
A better approach is to use an asynchronuous one, but unfortunately it is a lot more complex.
Some hints into this direction:
The solution is to replace code like this:
1) Do something.
2) Wait for something to happen.
3) Do something else.
With code like this:
1) Do something.
2) Arrange it so that when something happens, something else gets done.
3) Done.
Somewhere else, you have a few threads that do this:
1) Wait for anything to happen.
2) Handle whatever happened.
3) Go to step 1.
In the first case, if you're waiting for 50 things to happen, you have 50 threads sitting around waiting for 50 things to happen. In the second case, you have one thread waiting around that will do whichever of those 50 things need to get done.
So, don't use a thread to wait for a single thing to happen. Instead, arrange it so that when that thing happens, some other thread will do whatever needs to get done next.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With