Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pause and resuming job is not working in scrapy project

Tags:

python

scrapy

Am working on a scrapy project to download images from a site which requires authentication.Everything works fine and I am able to download images. What I need is to pause and resume the spider to crawl images whenever needed. So I used whatever mentioned in scrapy manual to do so as follows. While running the spider used the below mentioned query

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

To abort the engine pressed CTRL+C. To resume again used the same command.

But after resuming the spider is closed within few minutes,it doesn't resume from where it left off.

Updated:

class SampleSpider(Spider):
name = "sample project"
allowed_domains = ["xyz.com"]
start_urls = (
    'http://abcyz.com/',
    )

def parse(self, response):
    return FormRequest.from_response(response,
                                    formname='Loginform',
                                    formdata={'username': 'Name',
                                              'password': '****'},
                                    callback=self.after_login)

def after_login(self, response):
    # check login succeed before going on
    if "authentication error" in str(response.body).lower():
        print "I am error"
        return
    else:
        start_urls = ['..','..']
        for url in start_urls:
            yield Request(url=urls,callback=self.parse_phots,dont_filter=True)
def parse_photos(self,response):
     **downloading image here**

what am I doing wrong?

This is log I get when i run the spider after pause

2014-05-13 15:40:31+0530 [scrapy] INFO: Scrapy 0.22.0 started (bot: sampleproject)
2014-05-13 15:40:31+0530 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-05-13 15:40:31+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sampleproject.spiders', 'SPIDER_MODULES': ['sampleproject.spiders'], 'BOT_NAME': 'sampleproject'}
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled downloader middlewares: RedirectMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2014-05-13 15:40:31+0530 [sample] INFO: Spider opened
2014-05-13 15:40:31+0530 [sample] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-13 15:40:31+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-13 15:40:31+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

......................

2014-05-13 15:42:06+0530 [sample] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 141184,
     'downloader/request_count': 413,
     'downloader/request_method_count/GET': 412,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 11213203,
     'downloader/response_count': 413,
     'downloader/response_status_count/200': 412,
     'downloader/response_status_count/404': 1,
     'file_count': 285,
     'file_status_count/downloaded': 285,
     'finish_reason': 'shutdown',
     'finish_time': datetime.datetime(2014, 5, 13, 10, 12, 6, 534088),
     'item_scraped_count': 125,
     'log_count/DEBUG': 826,
     'log_count/ERROR': 1,
     'log_count/INFO': 9,
     'log_count/WARNING': 219,
     'request_depth_max': 12,
     'response_received_count': 413,
     'scheduler/dequeued': 127,
     'scheduler/dequeued/disk': 127,
     'scheduler/enqueued': 403,
     'scheduler/enqueued/disk': 403,
     'start_time': datetime.datetime(2014, 5, 13, 10, 10, 31, 232618)}
2014-05-13 15:42:06+0530 [sample] INFO: Spider closed (shutdown)

After resume it stops and displays

INFO: Scrapy 0.22.0 started (bot: sampleproject)
2014-05-13 15:42:32+0530 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-05-13 15:42:32+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sampleproject.spiders', 'SPIDER_MODULES': ['sampleproject.spiders'], 'BOT_NAME': 'sampleproject'}
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled downloader middlewares: RedirectMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2014-05-13 15:42:32+0530 [sample] INFO: Spider opened
*2014-05-13 15:42:32+0530 [sample] INFO: Resuming crawl (276 requests scheduled)*
2014-05-13 15:42:32+0530 [sample] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-13 15:42:32+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-13 15:42:32+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080


2014-05-13 15:43:19+0530 [sample] INFO: Closing spider (finished)
2014-05-13 15:43:19+0530 [sample] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
     'downloader/request_bytes': 132365,
     'downloader/request_count': 281,
     'downloader/request_method_count/GET': 281,
     'downloader/response_bytes': 567884,
     'downloader/response_count': 278,
     'downloader/response_status_count/200': 278,
     'file_count': 1,
     'file_status_count/downloaded': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 5, 13, 10, 13, 19, 554981),
     'item_scraped_count': 276,
     'log_count/DEBUG': 561,
     'log_count/ERROR': 1,
     'log_count/INFO': 8,
     'log_count/WARNING': 1,
     'request_depth_max': 1,
     'response_received_count': 278,
     'scheduler/dequeued': 277,
     'scheduler/dequeued/disk': 277,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/disk': 1,
     'start_time': datetime.datetime(2014, 5, 13, 10, 12, 32, 659276)}
2014-05-13 15:43:19+0530 [sample] INFO: Spider closed (finished)
like image 315
user Avatar asked May 13 '14 10:05

user


People also ask

How do you stop a spider from being Scrapy?

In the latest version of Scrapy, available on GitHub, you can raise a CloseSpider exception to manually close a spider. It succeeds to force stop, but not fast enough. It still lets some Request running.

How do you activate the Scrapy project?

You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]

Is Scrapy asynchronous?

Scrapy is asynchronous by default. Using coroutine syntax, introduced in Scrapy 2.0, simply allows for a simpler syntax when using Twisted Deferreds, which are not needed in most use cases, as Scrapy makes its usage transparent whenever possible.

What is download delay in Scrapy?

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. DOWNLOAD_DELAY = 0.25 # 250 ms of delay.

Is it possible to pause a crawl and resume it later?

Sometimes, for big sites, it’s desirable to pause crawls and be able to resume them later. Scrapy supports this functionality out of the box by providing the following facilities:

How do I set up a Scrapy project?

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run: This will create a tutorial directory with the following contents: Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites).

Can I use the Scrapy persistence support?

There are a few things to keep in mind if you want to be able to use the Scrapy persistence support: Cookies may expire. So, if you don’t resume your spider quickly the requests scheduled may no longer work. This won’t be an issue if your spider doesn’t rely on cookies.

Why is Scrapy calling the parse method?

The parse () method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse () is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.


1 Answers

instead of that command that you wrote, you can run this one:

scrapy crawl somespider --set JOBDIR=crawl1

And for stopping it you have to run control-C once! and wait for scrapy to stop. if you run control-C twice, it wont work properly!

Then for resuming the search run this command again:

scrapy crawl somespider --set JOBDIR=crawl1
like image 106
Maryam Homayouni Avatar answered Nov 14 '22 22:11

Maryam Homayouni