I'm researching web crawlers made in Python, and I've stumbled across a pretty simple one. But, I don't understand the last few lines, highlighted in the following code:
import sys
import re
import urllib2
import urlparse
tocrawl = [sys.argv[1]]
crawled = []
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s(?:.*?\s)*?href=[\'"](.*?)[\'"].*?>')
while 1:
crawling = tocrawl.pop(0)
response = urllib2.urlopen(crawling)
msg = response.read()
keywordlist = keywordregex.findall(msg)
crawled.append(crawling)
links = linkregex.findall(msg)
url = urlparse.urlparse(crawling)
a = (links.pop(0) for _ in range(len(links))) //What does this do?
for link in a:
if link.startswith('/'):
link = 'http://' + url[1] + link
elif link.startswith('#'):
link = 'http://' + url[1] + url[2] + link
elif not link.startswith('http'):
link = 'http://' + url[1] + '/' + link
if link not in crawled:
tocrawl.append(link)
That line looks like some kind of a list comprehension to me, but I'm not sure and I need an explanation.
It's a generator expression and it simply empties the list links as you iterate over it.
They could have replaced this part
a = (links.pop(0) for _ in range(len(links))) //What does this do?
for link in a:
With this:
while links:
link = links.pop(0)
And it would work the same. But since popping from the end of a list is more efficient, this would be better than either:
links.reverse()
while links:
link = links.pop()
Of course, if you're fine with following the links in reverse order (I don't see why they need to be processed in order), it would be even more efficient to not reverse the links list and just pop off the end.
It creates a generator which will take objects off the links list.
To explain:
range(len(links)) returns a list of numbers from 0 up to, but not including, the length of the links list. So if links contains [ "www.yahoo.com", "www.google.com", "www.python.org" ], then it will generate a list [ 0, 1, 2 ].
for _ in blah, just loops over the list, throwing away the result.
links.pop(0) removes the first item from links.
The entire expression returns a generator which pops items from the head of the links list.
And lastly, a demonstration in a python console:
>>> links = [ "www.yahoo.com", "www.google.com", "www.python.org "]
>>> a = (links.pop(0) for _ in range(len(links)))
>>> a.next()
'www.yahoo.com'
>>> links
['www.google.com', 'www.python.org ']
>>> a.next()
'www.google.com'
>>> links
['www.python.org ']
>>> a.next()
'www.python.org '
>>> links
[]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With