How can I make this recursive crawl function iterative?

Question

For academic and performance sake, given this crawl recursive web-crawling function (which crawls only within the given domain) what would be the best approach to make it run iteratively? Currently when it runs, by the time it finishes python has climbed to using over 1GB of memory which isn't acceptable for running in a shared environment.

   def crawl(self, url):
    "Get all URLS from which to scrape categories."
    try:
      links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag)
    except urllib2.HTTPError:
      return
    for link in links:
      for attr in link.attrs:
        if Crawler._match_attr(attr):
          if Crawler._is_category(attr):
            pass
          elif attr[1] not in self._crawled:
            self._crawled.append(attr[1])
            self.crawl(attr[1])

mmx · Accepted Answer

Use a BFS instead of crawling recursively (DFS): http://en.wikipedia.org/wiki/Breadth_first_search

You can use an external storage solution (such as a database) for BFS queue to free up RAM.

The algorithm is:

//pseudocode:
var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something).
var visitedUrls = new Set(); // List of visited URLs.

// initialization:
urlsToVisit.Add( rootUrl );

while(urlsToVisit.Count > 0) {
  var nextUrl = urlsToVisit.FetchAndRemoveNextUrl();
  var page = FetchPage(nextUrl);
  ProcessPage(page);
  visitedUrls.Add(nextUrl);
  var links = ParseLinks(page);
  foreach (var link in links)
     if (!visitedUrls.Contains(link))
        urlsToVisit.Add(link); 
}

Ber · Answer

Instead of recursing, you could put the new URLs to crawl into a queue. Then run until the queue is empty without recursing. If you put the queue into a file this uses almost no memory at all.

How can I make this recursive crawl function iterative?

Tags:

python

recursion

web-crawler

samuraisam

2 Answers

mmx

Ber

Recent Activity

Donate For Us

How can I make this recursive crawl function iterative?

Tags:

python

recursion

web-crawler

samuraisam

2 Answers

mmx

Ber

Related questions

Recent Activity

Donate For Us