Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write a python script for downloading?

I want to download some files from this site: http://www.emuparadise.me/soundtracks/highquality/index.php

But I only want to get certain ones.

Is there a way to write a python script to do this? I have intermediate knowledge of python

I'm just looking for a bit of guidance, please point me towards a wiki or library to accomplish this

thanks, Shrub

Here's a link to my code

like image 672
RN_ Avatar asked Sep 25 '12 22:09

RN_


2 Answers

I looked at the page. The links seem to redirect to another page, where the file is hosted, clicking which downloads the file.

I would use mechanize to follow the required links to the right page, and then use BeautifulSoup or lxml to parse the resultant page to get the filename.

Then it's a simple matter of opening the file using urlopen and writing its contents out into a local file like so:

f = open(localFilePath, 'w')
f.write(urlopen(remoteFilePath).read())
f.close()

Hope that helps

like image 117
inspectorG4dget Avatar answered Nov 10 '22 00:11

inspectorG4dget


Make a url request for the page. Once you have the source, filter out and get urls.

The files you want to download are urls that contain a specific extension. It is with this that you can do a regular expression search for all urls that match your criteria. After filtration, then do a url request for each matched url's data and write it to memory.

Sample code:

#!/usr/bin/python
import re
import sys
import urllib

#Your sample url
sampleUrl = "http://stackoverflow.com"
urlAddInfo = urllib.urlopen(sampleUrl)
data = urlAddInfo.read()

#Sample extensions we'll be looking for: pngs and pdfs
TARGET_EXTENSIONS = "(png|pdf)"
targetCompile = re.compile(TARGET_EXTENSIONS, re.UNICODE|re.MULTILINE)

#Let's get all the urls: match criteria{no spaces or " in a url}
urls = re.findall('(https?://[^\s"]+)', data, re.UNICODE|re.MULTILINE)

#We want these folks
extensionMatches = filter(lambda url: url and targetCompile.search(url), urls)

#The rest of the unmatched urls for which the scrapping can also be repeated.
nonExtMatches = filter(lambda url: url and not targetCompile.search(url), urls)


def fileDl(targetUrl):
  #Function to handle downloading of files.
  #Arg: url => a String
  #Output: Boolean to signify if file has been written to memory

  #Validation of the url assumed, for the sake of keeping the illustration short
  urlAddInfo = urllib.urlopen(targetUrl)
  data = urlAddInfo.read()
  fileNameSearch = re.search("([^\/\s]+)$", targetUrl) #Text right before the last slash '/'
  if not fileNameSearch:
     sys.stderr.write("Could not extract a filename from url '%s'\n"%(targetUrl))
     return False
  fileName = fileNameSearch.groups(1)[0]
  with open(fileName, "wb") as f:
    f.write(data)
    sys.stderr.write("Wrote %s to memory\n"%(fileName))
  return True

#Let's now download the matched files
dlResults = map(lambda fUrl: fileDl(fUrl), extensionMatches)
successfulDls = filter(lambda s: s, dlResults)
sys.stderr.write("Downloaded %d files from %s\n"%(len(successfulDls), sampleUrl))

#You can organize the above code into a function to repeat the process for each of the
#other urls and in that way you can make a crawler.

The above code is written mainly for Python2.X. However, I wrote a crawler that works on any version starting from 2.X

like image 42
Emmanuel Odeke Avatar answered Nov 09 '22 23:11

Emmanuel Odeke