Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Download all the files in the website

I need to download all the files under this links where only the suburb name keep changing in each link

Just a reference https://www.data.vic.gov.au/data/dataset/2014-town-and-community-profile-for-thornbury-suburb

All the files under this search link: https://www.data.vic.gov.au/data/dataset?q=2014+town+and+community+profile

Any possibilities?

Thanks :)

like image 716
Bharath Avatar asked Aug 07 '17 06:08

Bharath


2 Answers

You can download file like this

import urllib2
response = urllib2.urlopen('http://www.example.com/file_to_download')
html = response.read()

To get all the links in a page

from bs4 import BeautifulSoup

import requests
r  = requests.get("http://site-to.crawl")
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))
like image 124
naren Avatar answered Nov 14 '22 19:11

naren


You should first read the html, parse it using Beautiful Soup and then find links according to the file type you want to download. For instance, if you want to download all pdf files, you can check if the links end with the .pdf extension or not.

There's a good explanation and code available here:

https://medium.com/@dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

like image 27
x89 Avatar answered Oct 15 '22 21:10

x89