Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all links from website using Beautiful Soup (python) Recursively

I want to be able to recursively get all links from a website then follow those links and get all links from those websites. The depth should be 5-10 so that it returns back a an array of all links that it finds. Preferably using beautiful soup/python. Thanks!

I have tried this so far and it is not working....any help will be appreciated.

from BeautifulSoup import BeautifulSoup
import urllib2

def getLinks(url):
    if (len(url)==0):
        return [url]
    else:
        files = [ ]
        page=urllib2.urlopen(url)
        soup=BeautifulSoup(page.read())
        universities=soup.findAll('a',{'class':'institution'})
        for eachuniversity in universities:
           files+=getLinks(eachuniversity['href'])
        return files

print getLinks("http://www.utexas.edu/world/univ/alpha/")
like image 624
coderlyfe Avatar asked Nov 25 '13 17:11

coderlyfe


2 Answers

the number of crawling page will grow exponentially, there are many issues involved that might not look complicated in first look, check out scrapy architecture overview to get a sense of how it should be done in real life

enter image description here

among other great features scrapy will not repeat crawling same pages (unless you'll force it to) and can be configured for maximum DEPTH_LIMIT

even better yet, scrapy has a built in link extraction tools link-extractors

like image 136
Guy Gavriely Avatar answered Sep 27 '22 19:09

Guy Gavriely


Recursive algorithms are used to reduce big problems to smaller ones which have the same structure and then combine the results. They are often composed by a base case which doesn't lead to recursion and another case that leads to recursion. For example, say you were born at 1986 and you want to calculate your age. You could write:

def myAge(currentyear):
    if currentyear == 1986: #Base case, does not lead to recursion.
        return 0
    else:                   #Leads to recursion
        return 1+myAge(currentyear-1)

I, myself, don't really see the point in using recursion in your problem. My suggestion is first that you put a limit in your code. What you gave us will just run infinately, because the program gets stuck in infinately nested for loops; it never reaches an end and starts returning. So you can have a variable outside the function that updates every time you go down a level and at a certain point stops the function from starting a new for-loop and starts returning what it has found.

But then you are getting into changing global variables, you are using recursion in a strange way and the code gets messy.

Now reading the comments and seeng what you really want, which, I must say, is not really clear, you can use help from a recursive algorithm in your code, but not write all of it recursively.

def recursiveUrl(url,depth):

    if depth == 5:
        return url
    else:
        page=urllib2.urlopen(url)
        soup = BeautifulSoup(page.read())
        newlink = soup.find('a') #find just the first one
        if len(newlink) == 0:
            return url
        else:
            return url, recursiveUrl(newlink,depth+1)


def getLinks(url):
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    links = soup.find_all('a', {'class':'institution'})
    for link in links:
        links.append(recursiveUrl(link,0))
    return links

Now there is still a problem with this: links are not always linked to webpages, but also to files and images. That's why I wrote the if/else statement in the recursive part of the 'url-opening' function. The other problem is that your first website has 2166 institution links, and creating 2166*5 beautifulSoups is not fast. The code above runs a recursive function 2166 times. That shouldn't be a problem but you are dealing with big html(or php whatever) files so making a soup of 2166*5 takes a huge amount of time.

like image 29
JGallo Avatar answered Sep 27 '22 19:09

JGallo