Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Leak while parsing html page source with BeautifulSoup & Requests

So, the basic idea is to make get request to certain list URLs and parse text from those page sources by removing HTML tags and scripts using beautifulsoup. python version 2.7

The problem, at every request, parser function keep adding memory at every request. size increasing gradually.

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

even at local text file for parsing memory leaks. for example:

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

enter image description here

like image 336
wizard Avatar asked Aug 17 '18 11:08

wizard


People also ask

Can BeautifulSoup parse HTML?

The HTML content of the webpages can be parsed and scraped with Beautiful Soup.

Can BeautifulSoup handle broken HTML?

It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.

Is BeautifulSoup library is used to parse the document and for extracting HTML documents?

BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.


1 Answers

You can try to call garbage collector:

import gc
response.close()
response = None
gc.collect()

Also this might help you: Python high memory usage with BeautifulSoup

like image 68
Nuts Avatar answered Nov 15 '22 07:11

Nuts