I was trying to process several web pages with BeautifulSoup4 in python 2.7.3 but after every parse the memory usage goes up and up.
This simplified code produces the same behavior:
from bs4 import BeautifulSoup
def parse():
f = open("index.html", "r")
page = BeautifulSoup(f.read(), "lxml")
f.close()
while True:
parse()
raw_input()
After calling parse() for five times the python process already uses 30 MB of memory (used HTML file was around 100 kB) and it goes up by 4 MB every call. Is there a way to free that memory or some kind of workaround?
Update: This behavior gives me headaches. This code easily uses up plenty of memory even though the BeautifulSoup variable should be long deleted:
from bs4 import BeautifulSoup
import threading, httplib, gc
class pageThread(threading.Thread):
def run(self):
con = httplib.HTTPConnection("stackoverflow.com")
con.request("GET", "/")
res = con.getresponse()
if res.status == 200:
page = BeautifulSoup(res.read(), "lxml")
con.close()
def load():
t = list()
for i in range(5):
t.append(pageThread())
t[i].start()
for thread in t:
thread.join()
while not raw_input("load? "):
gc.collect()
load()
Could that be some kind of a bug maybe?
I know this is an old thread, but there's one more thing to keep in mind when parsing pages with beautifulsoup. When navigating a tree, and you are storing a specific value, be sure to get the string and not a bs4 object. For instance this caused a memory leak when used in a loop:
category_name = table_data.find('a').contents[0]
Which could be fixed by changing in into:
category_name = str(table_data.find('a').contents[0])
In the first example the type of category name is bs4.element.NavigableString
Try Beautiful Soup's decompose functionality, which destroys the tree, when you're done working with each file.
from bs4 import BeautifulSoup
def parse():
f = open("index.html", "r")
page = BeautifulSoup(f.read(), "lxml")
# page extraction goes here
page.decompose()
f.close()
while True:
parse()
raw_input()
Try garbage collecting:
from bs4 import BeautifulSoup
import gc
def parse():
f = open("index.html", "r")
page = BeautifulSoup(f.read(), "lxml")
page = None
gc.collect()
f.close()
while True:
parse()
raw_input()
See also:
Python garbage collection
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With