I built a scraper (worker) launched XX times through multithreading (via Jupyter Notebook, python 2.7, anaconda). Script is of the following format, as described on python.org:
def worker():
while True:
item = q.get()
do_work(item)
q.task_done()
q = Queue()
for i in range(num_worker_threads):
t = Thread(target=worker)
t.daemon = True
t.start()
for item in source():
q.put(item)
q.join() # block until all tasks are done
When I run the script as is, there are no issues. Memory is released after script finishes.
However, I want to run the said script 20 times (batching of sort), so I turn the script mentioned into a function, and run the function using code below:
def multithreaded_script():
my script #code from above
x = 0
while x<20:
x +=1
multithredaded_script()
memory builds up with each iteration, and eventually the system start writing it to disk.
Is there a way to clear out the memory after each run?
I tried:
sleep(30)
at end of each iteration (in case it takes time for ram to release)and nothing seems to help. Any ideas on what else I can try to get the memory to clear out after each run within the While statement? If not, is there a better way to execute my script XX times, that would not eat up the ram?
Thank you in advance.
Memory management Unlike many other languages, Python does not necessarily release the memory back to the Operating System. Instead, it has a dedicated object allocator for objects smaller than 512 bytes, which keeps some chunks of already allocated memory for further use in the future.
Python doesn't support multi-threading because Python on the Cpython interpreter does not support true multi-core execution via multithreading. However, Python does have a threading library. The GIL does not prevent threading.
Processes each have their own memory pool. This means it is slow to copy large amounts of data into them, or out of them. For example when running functions on large input arrays or DataFrames. Threads share the same memory as the main Python session, so there is no need to copy data to or from them.
Debugging. Once you detect that there is an unusual memory consumption pattern in your app, the next step to take is to debug your app to locate the cause. Python's inbuilt garbage collector enables you to debug your app's memory usage. You can view a list of objects in the memory that the garbage collector is aware of ...
TL;DR Solution: Make sure to end each function with return to ensure all local variables are destroyed from ram**
Per Pavel's suggestion, I used memory tracker (unfortunately suggested mem tracker did't work for me, so i used Pympler.)
Implementation was fairly simple:
from pympler.tracker import SummaryTracker
tracker = SummaryTracker()
~~~~~~~~~YOUR CODE
tracker.print_diff()
The tracker gave a nice output, which made it obvious that local variables generated by functions were not being destroyed.
Adding "return" at the end of every function fixed the issue.
Takeaway:
If you are writing a function that processes info/generates local variables, but doesn't pass local variables to anything else -> make sure to end the function with return anyways. This will prevent any issues that you may run into with memory leaks.
Additional notes on memory usage & BeautifulSoup:
If you are using BeautifulSoup / BS4 with multithreading and multiple workers, and have limited amount of free ram, you can also use soup.decompose()
to destroy soup variable right after you are done with it, instead of waiting for the function to return/code to stop running.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With