I want to be able to recursively get all links from a website then follow those links and get all links from those websites. The depth should be 5-10 so that it returns back a an array of all links that it finds. Preferably using beautiful soup/python. Thanks!
I have tried this so far and it is not working....any help will be appreciated.
from BeautifulSoup import BeautifulSoup
import urllib2
def getLinks(url):
if (len(url)==0):
return [url]
else:
files = [ ]
page=urllib2.urlopen(url)
soup=BeautifulSoup(page.read())
universities=soup.findAll('a',{'class':'institution'})
for eachuniversity in universities:
files+=getLinks(eachuniversity['href'])
return files
print getLinks("http://www.utexas.edu/world/univ/alpha/")
the number of crawling page will grow exponentially, there are many issues involved that might not look complicated in first look, check out scrapy architecture overview to get a sense of how it should be done in real life
among other great features scrapy will not repeat crawling same pages (unless you'll force it to) and can be configured for maximum DEPTH_LIMIT
even better yet, scrapy has a built in link extraction tools link-extractors
Recursive algorithms are used to reduce big problems to smaller ones which have the same structure and then combine the results. They are often composed by a base case which doesn't lead to recursion and another case that leads to recursion. For example, say you were born at 1986 and you want to calculate your age. You could write:
def myAge(currentyear):
if currentyear == 1986: #Base case, does not lead to recursion.
return 0
else: #Leads to recursion
return 1+myAge(currentyear-1)
I, myself, don't really see the point in using recursion in your problem. My suggestion is first that you put a limit in your code. What you gave us will just run infinately, because the program gets stuck in infinately nested for loops; it never reaches an end and starts returning. So you can have a variable outside the function that updates every time you go down a level and at a certain point stops the function from starting a new for-loop and starts returning what it has found.
But then you are getting into changing global variables, you are using recursion in a strange way and the code gets messy.
Now reading the comments and seeng what you really want, which, I must say, is not really clear, you can use help from a recursive algorithm in your code, but not write all of it recursively.
def recursiveUrl(url,depth):
if depth == 5:
return url
else:
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
newlink = soup.find('a') #find just the first one
if len(newlink) == 0:
return url
else:
return url, recursiveUrl(newlink,depth+1)
def getLinks(url):
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
links = soup.find_all('a', {'class':'institution'})
for link in links:
links.append(recursiveUrl(link,0))
return links
Now there is still a problem with this: links are not always linked to webpages, but also to files and images. That's why I wrote the if/else statement in the recursive part of the 'url-opening' function. The other problem is that your first website has 2166 institution links, and creating 2166*5 beautifulSoups is not fast. The code above runs a recursive function 2166 times. That shouldn't be a problem but you are dealing with big html(or php whatever) files so making a soup of 2166*5 takes a huge amount of time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With