Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing. I want to download everything on a page to make it possible to view it just from the files. The command <blockquote> wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows </blockquote> does exactly that I need. However we want to be able to tie it in with other stuff that must be portable, so requires it to be in Python. I've been looking at Beautiful Soup, scrapy, various spiders posted around the place, but these all seem to deal with getting data/links in clever but specific ways. Using these to do what I want seems like it will require a lot of work to deal with finding all of the resources, when I'm sure there must be an easy way. thanks very much

You should be using an appropriate tool for the job at hand. If you want to spider a site and save the pages to disk, Python probably isn't the best choice for that. Open source projects get features when someone needs that feature, and because <code>wget</code> does its job so well, nobody has bothered to try to write a python library to replace it. Considering wget runs on pretty much any platform that has a Python interpreter, is there a reason you can't use wget?

My colleague wrote up this code, lots pieced together from other sources I believe. Might have some specific quirks for our system but it should help anyone wanting to do the same <pre class="prettyprint"><code>""" Downloads all links from a specified location and saves to machine. Downloaded links will only be of a lower level then links specified. To use: python downloader.py link """ import sys,re,os,urllib2,urllib,urlparse tocrawl = set([sys.argv[1]]) # linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?') linkregex = re.compile('href=[\'|"](.*?)[\'"].*?') linksrc = re.compile('src=[\'|"](.*?)[\'"].*?') def main(): link_list = []##create a list of all found links so there are no duplicates restrict = sys.argv[1]##used to restrict found links to only have lower level link_list.append(restrict) parent_folder = restrict.rfind('/', 0, len(restrict)-1) ##a.com/b/c/d/ make /d/ as parent folder while 1: try: crawling = tocrawl.pop() #print crawling except KeyError: break url = urlparse.urlparse(crawling)##splits url into sections try: response = urllib2.urlopen(crawling)##try to open the url except: continue msg = response.read()##save source of url links = linkregex.findall(msg)##search for all href in source links = links + linksrc.findall(msg)##search for all src in source for link in (links.pop(0) for _ in xrange(len(links))): if link.startswith('/'): ##if /xxx a.com/b/c/ -> a.com/b/c/xxx link = 'http://' + url[1] + link elif ~link.find('#'): continue elif link.startswith('../'): if link.find('../../'):##only use links that are max 1 level above reference ##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html parent_pos = url[2].rfind('/') parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1 parent_url = url[2][:parent_pos] new_link = link.find('/')+1 link = link[new_link:] link = 'http://' + url[1] + parent_url + link else: continue elif not link.startswith('http'): if url[2].find('.html'): ##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html a = url[2].rfind('/')+1 parent = url[2][:a] link = 'http://' + url[1] + parent + link else: ##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html link = 'http://' + url[1] + url[2] + link if link not in link_list: link_list.append(link)##add link to list of already found links if (~link.find(restrict)): ##only grab links which are below input site print link ##print downloaded link tocrawl.add(link)##add link to pending view links file_name = link[parent_folder+1:]##folder structure for files to be saved filename = file_name.rfind('/') folder = file_name[:filename]##creates folder names folder = os.path.abspath(folder)##creates folder path if not os.path.exists(folder): os.makedirs(folder)##make folder if it does not exist try: urllib.urlretrieve(link, file_name)##download the link except: print "could not download %s"%link else: continue if __name__ == "__main__": main() </code></pre> thanks for the replies

Equivalent of wget in Python to download website and resources

Tags:

python

wget

web-crawler

Same thing asked 2.5 years ago in Downloading a web page and all of its resource files in Python but doesn't lead to an answer and the 'please see related topic' isn't really asking the same thing.

I want to download everything on a page to make it possible to view it just from the files.

The command

wget --page-requisites --domains=DOMAIN --no-parent --html-extension --convert-links --restrict-file-names=windows

does exactly that I need. However we want to be able to tie it in with other stuff that must be portable, so requires it to be in Python.

I've been looking at Beautiful Soup, scrapy, various spiders posted around the place, but these all seem to deal with getting data/links in clever but specific ways. Using these to do what I want seems like it will require a lot of work to deal with finding all of the resources, when I'm sure there must be an easy way.

thanks very much

788

asked Feb 10 '12 00:02

Conrad

2 Answers

You should be using an appropriate tool for the job at hand.

If you want to spider a site and save the pages to disk, Python probably isn't the best choice for that. Open source projects get features when someone needs that feature, and because wget does its job so well, nobody has bothered to try to write a python library to replace it.

Considering wget runs on pretty much any platform that has a Python interpreter, is there a reason you can't use wget?

120

answered Nov 02 '22 01:11

ironchefpython

My colleague wrote up this code, lots pieced together from other sources I believe. Might have some specific quirks for our system but it should help anyone wanting to do the same

"""
    Downloads all links from a specified location and saves to machine.
    Downloaded links will only be of a lower level then links specified.
    To use: python downloader.py link
"""
import sys,re,os,urllib2,urllib,urlparse
tocrawl = set([sys.argv[1]])
# linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?')
linkregex = re.compile('href=[\'|"](.*?)[\'"].*?')
linksrc = re.compile('src=[\'|"](.*?)[\'"].*?')
def main():
    link_list = []##create a list of all found links so there are no duplicates
    restrict = sys.argv[1]##used to restrict found links to only have lower level
    link_list.append(restrict)
    parent_folder = restrict.rfind('/', 0, len(restrict)-1)
    ##a.com/b/c/d/ make /d/ as parent folder
    while 1:
        try:
            crawling = tocrawl.pop()
            #print crawling
        except KeyError:
            break
        url = urlparse.urlparse(crawling)##splits url into sections
        try:
            response = urllib2.urlopen(crawling)##try to open the url
        except:
            continue
        msg = response.read()##save source of url
        links = linkregex.findall(msg)##search for all href in source
        links = links + linksrc.findall(msg)##search for all src in source
        for link in (links.pop(0) for _ in xrange(len(links))):
            if link.startswith('/'):
                ##if /xxx a.com/b/c/ -> a.com/b/c/xxx
                link = 'http://' + url[1] + link
            elif ~link.find('#'):
                continue
            elif link.startswith('../'):
                if link.find('../../'):##only use links that are max 1 level above reference
                    ##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html
                    parent_pos = url[2].rfind('/')
                    parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1
                    parent_url = url[2][:parent_pos]
                    new_link = link.find('/')+1
                    link = link[new_link:]
                    link = 'http://' + url[1] + parent_url + link
                else:
                    continue
            elif not link.startswith('http'):
                if url[2].find('.html'):
                    ##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html
                    a = url[2].rfind('/')+1
                    parent = url[2][:a]
                    link = 'http://' + url[1] + parent + link
                else:
                    ##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html
                    link = 'http://' + url[1] + url[2] + link
            if link not in link_list:
                link_list.append(link)##add link to list of already found links
                if (~link.find(restrict)):
                ##only grab links which are below input site
                    print link ##print downloaded link
                    tocrawl.add(link)##add link to pending view links
                    file_name = link[parent_folder+1:]##folder structure for files to be saved
                    filename = file_name.rfind('/')
                    folder = file_name[:filename]##creates folder names
                    folder = os.path.abspath(folder)##creates folder path
                    if not os.path.exists(folder):
                        os.makedirs(folder)##make folder if it does not exist
                    try:
                        urllib.urlretrieve(link, file_name)##download the link
                    except:
                        print "could not download %s"%link
                else:
                    continue
if __name__ == "__main__":
    main()

thanks for the replies

answered Nov 02 '22 00:11

Conrad

Related questions
                            
                                python: urllib2 using different network interface
                            
                                How to do a memset with Python buffer object?
                            
                                Find the largest divisor of N that is less than sqrt(N)
                            
                                How to make Tkinter message expand when I resize the window?
                            
                                How to vectorize the evaluation of bilinear & quadratic forms?
                            
                                Is it possible to concatenate QuerySets?
                            
                                Python Class Variable Initialization
                            
                                Python imports being overridden by the standard library (Python 2.4)
                            
                                Python subprocess to Bash: curly braces
                            
                                control memory alignment in python ctypes
                            
                                Remove default "Python" submenu with Tkinter Menu on Mac OSX
                            
                                beautiful soup getting href based on a text
                            
                                Does a transaction start even on SELECT?
                            
                                How can i do replace a child element(s) in ElementTree
                            
                                Partially Evaluating Python Classmethod Based on Where It's Accessed From
                            
                                CSS not rendered by Pisa's pdf generation in Django
                            
                                Python Generated Signature for S3 Post
                            
                                python: regex only gets the last occurrence
                            
                                Python watching for process start up?
                            
                                Creating package installer in OS X - install Python, NumPy and other dependencies

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Equivalent of wget in Python to download website and resources

Tags:

python

wget

web-crawler

Conrad

People also ask

2 Answers

ironchefpython

Conrad

Recent Activity

Donate For Us