Memory Leak while parsing html page source with BeautifulSoup & Requests

Tags:

So, the basic idea is to make get request to certain list URLs and parse text from those page sources by removing HTML tags and scripts using beautifulsoup. python version 2.7

The problem, at every request, parser function keep adding memory at every request. size increasing gradually.

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

even at local text file for parsing memory leaks. for example:

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

enter image description here

336

asked Aug 17 '18 11:08

wizard

1 Answers

You can try to call garbage collector:

import gc
response.close()
response = None
gc.collect()

Also this might help you: Python high memory usage with BeautifulSoup

answered Nov 15 '22 07:11

Nuts

Related questions
                            
                                "Resolve Package Not Found" error in anaconda
                            
                                Should logger be an argument or a global variable?
                            
                                Youtube Analytics API returns 403 forbidden even if token is valid
                            
                                Pandas 'reduce' and 'accumulate' functions - incomplete implementation
                            
                                Get color scheme from GTK
                            
                                Serializing a sqlalchemy hybrid_property using marshmallow
                            
                                Can't execute msg (and other) Windows commands via subprocess
                            
                                Writing dictionary of dataframes to file
                            
                                what is the difference between 'import a.b as b' and 'from a import b' in python [duplicate]
                            
                                How to enable autocomplete (IntelliSense) for python package modules?
                            
                                How to convert exe back to Python script
                            
                                Autoencoder loss is not decreasing (and starts very high)
                            
                                Cython: undefined symbol
                            
                                Variable context between two blocks in Django templates?
                            
                                pandas iteratively update column values
                            
                                Install and use RPy2 (using conda) so that it uses default R installation in /usr/lib/R R
                            
                                List of classinfo Types
                            
                                asyncio event_loop in a Flask app
                            
                                What is the running time (big "O" order) of pandas DataFrame.join?
                            
                                Accessing webcam via DirectShow using COM with Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory Leak while parsing html page source with BeautifulSoup & Requests

Tags:

python

memory-leaks

beautifulsoup

python-requests

wizard

People also ask

1 Answers

Nuts

Recent Activity

Donate For Us