What does the following Python code do? It's like a list comprehension with parentheses.

Question

I'm researching web crawlers made in Python, and I've stumbled across a pretty simple one. But, I don't understand the last few lines, highlighted in the following code:

import sys
import re
import urllib2
import urlparse

tocrawl = [sys.argv[1]]
crawled = []

keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')

while 1:
    crawling = tocrawl.pop(0)
    response = urllib2.urlopen(crawling)
    msg = response.read()
    keywordlist = keywordregex.findall(msg)
    crawled.append(crawling)
    links = linkregex.findall(msg)
    url = urlparse.urlparse(crawling)

    a = (links.pop(0) for _ in range(len(links))) //What does this do?

    for link in a:
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link

        if link not in crawled:
            tocrawl.append(link)

That line looks like some kind of a list comprehension to me, but I'm not sure and I need an explanation.

Lauritz V. Thaulow · Accepted Answer

It's a generator expression and it simply empties the list links as you iterate over it.

They could have replaced this part

a = (links.pop(0) for _ in range(len(links))) //What does this do?

for link in a:

With this:

while links:
    link = links.pop(0)

And it would work the same. But since popping from the end of a list is more efficient, this would be better than either:

links.reverse()
while links:
    link = links.pop()

Of course, if you're fine with following the links in reverse order (I don't see why they need to be processed in order), it would be even more efficient to not reverse the links list and just pop off the end.

Martin · Answer

It creates a generator which will take objects off the links list.

To explain:

range(len(links)) returns a list of numbers from 0 up to, but not including, the length of the links list. So if links contains [ "www.yahoo.com", "www.google.com", "www.python.org" ], then it will generate a list [ 0, 1, 2 ].

for _ in blah, just loops over the list, throwing away the result.

links.pop(0) removes the first item from links.

The entire expression returns a generator which pops items from the head of the links list.

And lastly, a demonstration in a python console:

>>> links = [ "www.yahoo.com", "www.google.com", "www.python.org "]
>>> a = (links.pop(0) for _ in range(len(links)))
>>> a.next()
'www.yahoo.com'
>>> links
['www.google.com', 'www.python.org ']
>>> a.next()
'www.google.com'
>>> links
['www.python.org ']
>>> a.next()
'www.python.org '
>>> links
[]

What does the following Python code do? It's like a list comprehension with parentheses.

Tags:

python

web-crawler

corazza

2 Answers

Lauritz V. Thaulow

Martin

Recent Activity

Donate For Us

What does the following Python code do? It's like a list comprehension with parentheses.

Tags:

python

web-crawler

corazza

2 Answers

Lauritz V. Thaulow

Martin

Related questions

Recent Activity

Donate For Us