Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

the order of Scrapy Crawling URLs with long start_urls list and urls yiels from spider

Help! Reading the source code of Scrapy is not easy for me. I have a very long start_urls list. it is about 3,000,000 in a file. So,I make the start_urls like this:

start_urls = read_urls_from_file(u"XXXX")
def read_urls_from_file(file_path):
    with codecs.open(file_path, u"r", encoding=u"GB18030") as f:
        for line in f:
            try:
                url = line.strip()
                yield url
            except:
                print u"read line:%s from file failed!" % line
                continue
    print u"file read finish!"

MeanWhile, my spider's callback functions are like this:

  def parse(self, response):
        self.log("Visited %s" % response.url)
        return  Request(url=("http://www.baidu.com"), callback=self.just_test1)
    def just_test1(self, response):
        self.log("Visited %s" % response.url)
        return Request(url=("http://www.163.com"), callback=self.just_test2)
    def just_test2(self, response):
        self.log("Visited %s" % response.url)
        return []

my questions are:

  1. the order of the urls used by downloader? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)
  2. What decides the order? Why and How is this order? How can we control it?
  3. Is this a good way to deal with so many urls which are already in a file? What else?

Thank you very much!!!

Thanks for answers.But I am still a bit confused: By default, Scrapy uses a LIFO queue for storing pending requests.

  1. The requests made by spiders' callback function will be given to the scheduler.Who does the same thing to start_url's requests?The spider start_requests() function only generate an iterator without giving the real requests.
  2. Will the all requests(start_url's and callback's) be in the same request's queue?How many queues are there in Scrapy?
like image 992
YuBo Xian Avatar asked Jun 01 '13 17:06

YuBo Xian


1 Answers

First of all, please see this thread - I think you'll find all the answers there.

the order of the urls used by downloader? Will the requests made by just_test1,just_test2 be used by downloader only after the all start_urls are used?(I have made some tests, it seems that the answer is No)

You are right, the answer is No. The behavior is completely asynchronous: when the spider starts, start_requests method is called (source):

def start_requests(self):
    for url in self.start_urls:
        yield self.make_requests_from_url(url)

def make_requests_from_url(self, url):
    return Request(url, dont_filter=True)

What decides the order? Why and How is this order? How can we control it?

By default, there is no pre-defined order - you cannot know when Requests from make_requests_from_url will arrive - it's asynchronous.

See this answer on how you may control the order. Long story short, you can override start_requests and mark yielded Requests with priority key (like yield Request(url, meta={'priority': 0})). For example, the value of priority can be the line number where the url was found.

Is this a good way to deal with so many urls which are already in a file? What else?

I think you should read your file and yield urls directly in start_requests method: see this answer.

So, you should do smth like this:

def start_requests(self):
    with codecs.open(self.file_path, u"r", encoding=u"GB18030") as f:
        for index, line in enumerate(f):
            try:
                url = line.strip()
                yield Request(url, meta={'priority': index})
            except:
                continue

Hope that helps.

like image 161
alecxe Avatar answered Sep 23 '22 15:09

alecxe