Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Downloading a web page and all of its resource files in Python

I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com".

Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources.

Thanks Mark

like image 490
Mark Ransom Avatar asked May 09 '09 21:05

Mark Ransom


People also ask

Can I use Python to download files from website?

Requests is a versatile HTTP library in python with various applications. One of its applications is to download a file from web using the file URL. Or download it directly from here and install manually.

How do I download multiple files concurrently in Python?

Download multiple files in parallel with Python To start, create a function ( download_parallel ) to handle the parallel download. The function ( download_parallel ) will take one argument, an iterable containing URLs and associated filenames (the inputs variable we created earlier).


2 Answers

Websucker? See http://effbot.org/zone/websucker.htm

like image 54
RichieHindle Avatar answered Oct 18 '22 20:10

RichieHindle


websucker.py doesn't import css links. HTTrack.com is not python, it's C/C++, but it's a good, maintained, utility for downloading a website for offline browsing.

http://www.mail-archive.com/[email protected]/msg13523.html [issue1124] Webchecker not parsing css "@import url"

Guido> This is essentially unsupported and unmaintaned example code. Feel free to submit a patch though!

like image 29
jamshid Avatar answered Oct 18 '22 21:10

jamshid