Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the following Python code do? It's like a list comprehension with parentheses.

I'm researching web crawlers made in Python, and I've stumbled across a pretty simple one. But, I don't understand the last few lines, highlighted in the following code:

import sys
import re
import urllib2
import urlparse

tocrawl = [sys.argv[1]]
crawled = []

keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')

while 1:
    crawling = tocrawl.pop(0)
    response = urllib2.urlopen(crawling)
    msg = response.read()
    keywordlist = keywordregex.findall(msg)
    crawled.append(crawling)
    links = linkregex.findall(msg)
    url = urlparse.urlparse(crawling)

    a = (links.pop(0) for _ in range(len(links))) //What does this do?

    for link in a:
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link

        if link not in crawled:
            tocrawl.append(link)

That line looks like some kind of a list comprehension to me, but I'm not sure and I need an explanation.

like image 599
corazza Avatar asked May 14 '26 09:05

corazza


2 Answers

It's a generator expression and it simply empties the list links as you iterate over it.

They could have replaced this part

a = (links.pop(0) for _ in range(len(links))) //What does this do?

for link in a:

With this:

while links:
    link = links.pop(0)

And it would work the same. But since popping from the end of a list is more efficient, this would be better than either:

links.reverse()
while links:
    link = links.pop()

Of course, if you're fine with following the links in reverse order (I don't see why they need to be processed in order), it would be even more efficient to not reverse the links list and just pop off the end.

like image 140
Lauritz V. Thaulow Avatar answered May 16 '26 00:05

Lauritz V. Thaulow


It creates a generator which will take objects off the links list.

To explain:

range(len(links)) returns a list of numbers from 0 up to, but not including, the length of the links list. So if links contains [ "www.yahoo.com", "www.google.com", "www.python.org" ], then it will generate a list [ 0, 1, 2 ].

for _ in blah, just loops over the list, throwing away the result.

links.pop(0) removes the first item from links.

The entire expression returns a generator which pops items from the head of the links list.

And lastly, a demonstration in a python console:

>>> links = [ "www.yahoo.com", "www.google.com", "www.python.org "]
>>> a = (links.pop(0) for _ in range(len(links)))
>>> a.next()
'www.yahoo.com'
>>> links
['www.google.com', 'www.python.org ']
>>> a.next()
'www.google.com'
>>> links
['www.python.org ']
>>> a.next()
'www.python.org '
>>> links
[]
like image 42
Martin Avatar answered May 15 '26 23:05

Martin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!