Python - save requests or BeautifulSoup object locally

Tags:

I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case "name") or the BeautifulSoup object (in this case "soup") locally so that next time I can save time. Here is the code:

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

744

asked May 29 '14 22:05

bill999

2 Answers

Since name.content is just HTML, you can just dump this to a file and read it back later.

Usually the bottleneck is not the parsing, but instead the network latency of making requests.

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup

Here is some anecdotal evidence for the fact that bottleneck is in the network.

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2

Output, from running on Thinkpad X1 Carbon, with a fast campus network.

0.11 0.02

answered Oct 04 '22 23:10

merlin2011

Storing requests locally and restoring them as Beautifoul Soup object latter on

If you are iterating through pages of web site you can store each page with request explained here. Create folder soupCategory in same folder where your script is.

Use any latest user agent for headers

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return

Latter on you can create Beautifoul Soup object as @merlin2011 mentioned with:

with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

answered Oct 04 '22 22:10

Hrvoje

Related questions
                            
                                Split .TIF file using PIL
                            
                                What is the fastest way to quadratic form numpy array multiplication?
                            
                                python gzipped fileinput returns binary string instead of text string
                            
                                Unicode output in ipython notebook
                            
                                factory_boy: add several dependent objects
                            
                                How to display a pdf that has been downloaded in python
                            
                                Django filter objects with at least one many-to-many having attribute of value
                            
                                Flask: Multiple blueprints interfere with each other
                            
                                Python MySQLdb - Error 1045: Access denied for user
                            
                                PyDev interactive console
                            
                                Python script in Powershell: Remote Exception and NativeCommandError
                            
                                Running python script as root
                            
                                Upgrade path for re-usable apps with South AND django 1.7 migrations
                            
                                Trigger python code from Google spreadsheets?
                            
                                Generate ID from string in Python
                            
                                Can Sqlalchemy work well with multiple attached SQLite database files?
                            
                                How to iterate through a list of coordinates and calculate distance between them in Python
                            
                                How to run webpage code with PhantomJS via GhostDriver (selenium)
                            
                                how to fix the 'AnonymousUser' object has no attribute 'profile' error?
                            
                                How to create a huge sparse matrix in scipy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - save requests or BeautifulSoup object locally

Tags:

python

file

beautifulsoup

scrape