Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - How to crawl new pages based on links in scraped items

Tags:

python

scrapy

I am new to Scrapy and I am trying to scrape new pages from the link in scraped items. Specifically, I want to scrape some file sharing links on dropbox from google search result, and store these links in a JSON file. After getting these links, I want to open a new page for each link to verify whether the link is valid or not. If it is valid, I want to store the file name to the JSON file, too.

I use an DropboxItem with attributes 'link', 'filename', 'status', 'err_msg' to store each scraped item, and I try to initiate an asynchronous request for each scraped link in the parse function. But it seems that the parse_file_page function is never called. Does anyone know how to implement such a two-steps crawling?

    class DropboxSpider(Spider):
        name = "dropbox"
        allowed_domains = ["google.com"]
        start_urls = [
            "https://www.google.com/#filter=0&q=site:www.dropbox.com/s/&start=0"
    ]

        def parse(self, response):
            sel = Selector(response)
            sites = sel.xpath("//h3[@class='r']")
            items = []
            for site in sites:
                item = DropboxItem()
                link = site.xpath('a/@href').extract()
                item['link'] = link
                link = ''.join(link)
                #I want to parse a new page with url=link here
                new_request = Request(link, callback=self.parse_file_page)
                new_request.meta['item'] = item
                items.append(item)
            return items

        def parse_file_page(self, response):
            #item passed from request
            item = response.meta['item']
            #selector
            sel = Selector(response)
            content_area = sel.xpath("//div[@id='shmodel-content-area']")
            filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
            if filename_area:
                filename = filename_area.xpath("span[@id]/text()").extract()
                if filename:
                    item['filename'] = filename             
                    item['status'] = "normal"
            else:
                err_area = content_area.xpath("div[@class='err']")
                if err_area:
                    err_msg = err_area.xpath("h3/text()").extract()
                    item['err_msg'] = err_msg
                    item['status'] = "error"
            return item

Thanks for @ScrapyNovice 's answer. I modified the code. Now it looks like

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath("//h3[@class='r']")
    #items = []
    for site in sites:
        item = DropboxItem()
        link = site.xpath('a/@href').extract()
        item['link'] = link
        link = ''.join(link)
        print 'link!!!!!!=', link
        new_request = Request(link, callback=self.parse_file_page)
        new_request.meta['item'] = item
        yield new_request
        #items.append(item)
    yield item
    return
    #return item   #Note, when I simply return item here, got an error msg "SyntaxError: 'return' with argument inside generator"

def parse_file_page(self, response):
    #item passed from request
    print 'parse_file_page!!!'
    item = response.meta['item']
    #selector
    sel = Selector(response)
    content_area = sel.xpath("//div[@id='shmodel-content-area']")
    filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
    if filename_area:
        filename = filename_area.xpath("span[@id]/text()").extract()
        if filename:
            item['filename'] = filename
            item['status'] = "normal"
            item['err_msg'] = "none"
            print 'filename=', filename
    else:
        err_area = content_area.xpath("div[@class='err']")
        if err_area:
            err_msg = err_area.xpath("h3/text()").extract()
            item['filename'] = "null"
            item['err_msg'] = err_msg
            item['status'] = "error"
            print 'err_msg', err_msg
        else:
            item['filename'] = "null"
            item['err_msg'] = "unknown_err"
            item['status'] = "error"
            print 'unknown err'
    return item

The control flow actually becomes quite strange. When I use "scrapy crawl dropbox -o items_dropbox.json -t json" to crawl a local file (a downloaded page of google search result), I can see output like

2014-05-31 08:40:35-0400 [scrapy] INFO: Scrapy 0.22.2 started (bot: tutorial)
2014-05-31 08:40:35-0400 [scrapy] INFO: Optional features available: ssl, http11
2014-05-31 08:40:35-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'items_dropbox.json', 'BOT_NAME': 'tutorial'}
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled item pipelines: 
2014-05-31 08:40:35-0400 [dropbox] INFO: Spider opened
2014-05-31 08:40:35-0400 [dropbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Crawled (200) <GET file:///home/xin/Downloads/dropbox_s/dropbox_s_1-Google.html> (referer: None)
link!!!!!!= http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0
link!!!!!!= https://www.dropbox.com/s/
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Filtered offsite request to 'www.dropbox.com': <GET https://www.dropbox.com/s/>
link!!!!!!= https://www.dropbox.com/s/awg9oeyychug66w
link!!!!!!= http://www.dropbox.com/s/kfmoyq9y4vrz8fm
link!!!!!!= https://www.dropbox.com/s/pvsp4uz6gejjhel
....  many links here
link!!!!!!= https://www.dropbox.com/s/gavgg48733m3918/MailCheck.xlsx
link!!!!!!= http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Scraped from <200 file:///home/xin/Downloads/dropbox_s/dropbox_s_1-Google.html>
    {'link': [u'http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk']}
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Crawled (200) <GET http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0> (referer: file:///home/xin/Downloads/dropbox_s/dropbox_s_1-Google.html)
parse_file_page!!!
unknown err
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Scraped from <200 http://www.google.com/intl/en/webmasters/>
    {'err_msg': 'unknown_err',
     'filename': 'null',
     'link': [u'http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0'],
     'status': 'error'}
2014-05-31 08:40:35-0400 [dropbox] INFO: Closing spider (finished)
2014-05-31 08:40:35-0400 [dropbox] INFO: Stored json feed (2 items) in: items_dropbox.json
2014-05-31 08:40:35-0400 [dropbox] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 558,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 449979,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 5, 31, 12, 40, 35, 348058),
     'item_scraped_count': 2,
     'log_count/DEBUG': 7,
     'log_count/INFO': 8,
     'request_depth_max': 1,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2014, 5, 31, 12, 40, 35, 249309)}
2014-05-31 08:40:35-0400 [dropbox] INFO: Spider closed (finished)

Now the json file only has:

[{"link": ["http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk"]},
{"status": "error", "err_msg": "unknown_err", "link": ["http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0"], "filename": "null"}]
like image 821
xin_cucs Avatar asked May 27 '14 05:05

xin_cucs


1 Answers

You're creating a Request and setting the callback nicely, but you never do anything with it.

        for site in sites:
            item = DropboxItem()
            link = site.xpath('a/@href').extract()
            item['link'] = link
            link = ''.join(link)
            #I want to parse a new page with url=link here
            new_request = Request(link, callback=self.parse_file_page)
            new_request.meta['item'] = item
            yield new_request
            # Don't do this here because you're adding your Item twice.
            #items.append(item)

On more of a design level, you're storing all your scraped Items in items at the end of parse(), but pipelines usually expect to receive individual Items, not arrays of them. Get rid of the items array, and you'll be able to use the JSON Feed Export built-in to Scrapy to store the results in a JSON format.

Update:

The reason you get an error message when you try to return an item is because using yield in a function turns it into a generator. This allows you to call the function repeatedly. Each time it gets to a yield, it returns the value you're yielding, but remembers its state and what line it was executing. The next time you call the generator, it resumes executing from where it left off last time. If it's out of things to yield, it raises a StopIteration exception. In Python 2, you're not allowed to mix yield and return in the same function.

You don't want to yield any items from parse(), because they're still missing their filename, status, etc at that point.

The requests you're yeilding in parse() are on dropbox.com, correct? The requests aren't going through because dropbox is not in the spider's allowed_domains. (hence the log message: DEBUG: Filtered offsite request to 'www.dropbox.com': <GET https://www.dropbox.com/s/>)

The one Request that actually works and isn't filtered leads to http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0, which is one of Google's pages, not DropBox's. You probably want to use urlparse to check the domain of the link before you make a Request for it in your parse() method.

As for your results: The first JSON object

{"link": ["http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk"]}

is from where you're calling yield item in your parse() method. There's only one because your yield isn't in a loop of any kind, so when the generator resumes executing, it runs the next line: return, which exits the generator. You'll notice this item is missing all of the fields that you fill-in in the parse_file_page() method. This is why you don't want to yield any items in your parse() method.

Your second JSON object

{
 "status": "error", 
 "err_msg": "unknown_err", 
 "link": ["http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0"], 
 "filename": "null"
}

is the result of trying to parse one of Google's pages as if it were the DropBox page you had been expecting. You're yielding multiple Requests in your parse() method, and all but one of them point at dropbox.com. All of the DropBox links are being dropped because they're not in your allowed_domains, so the only Response you're getting is for the one other link on the page that maches your xpath selector AND is from one of the sites in your allowed_sites. (this is the google webmasters link) That's why you're only seeing parse_file_page!!! once in your output.

I recommend learning more about generators, as they are a fundamental part of using Scrapy. The second Google result for "python generator tutorial" looks like a very good place to start.

like image 75
ScrapyNovice Avatar answered Oct 06 '22 06:10

ScrapyNovice