I am mentioning only SOME of the questions that I have referred before posting this question (I currently don't have links to all of those questions that I had referred to, before posting this question)-:
I am able to run this code completely, if I don't pass the arguments and ask for an input from the user from the BBSpider Class (without the main function - ust below the name="dmoz" line), or provide them as pre-defined (i.e, static) arguments.
My code is here.
I am basically trying to execute a Scrapy spider from a Python Script without the requirement of any additional files (even the Settings File). That is why, I have specified the settings also inside the code itself.
This is the output that I am getting on executing this script-:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines:
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)
The problems that I am currently facing-:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines:
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)
Please help me resolve the above 2 errors.
Basic Script The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
The spider will receive arguments in its constructor. Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break. Succinct, robust and flexible!
Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]
Creating the Spider Simply drop into a Python shell, import the Spider class, initialize it with your target site, and you're done. Within seconds you have a categorized list of URL's!
You need to pass your parameters on the crawl
method of the CrawlerProcess
, so you need to run it like this:
crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With