I am mentioning only SOME of the questions that I have referred before posting this question (I currently don't have links to all of those questions that I had referred to, before posting this question)-: <ul> <li>Question 1</li> <li>Question 2</li> </ul> I am able to run this code completely, if I don't pass the arguments and ask for an input from the user from the BBSpider Class (without the main function - ust below the name="dmoz" line), or provide them as pre-defined (i.e, static) arguments. My code is here. I am basically trying to execute a Scrapy spider from a Python Script without the requirement of any additional files (even the Settings File). That is why, I have specified the settings also inside the code itself. This is the output that I am getting on executing this script-: <pre class="prettyprint"><code>http://bigbasket.com/ps/?q=apple 2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11 2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {} 2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState None 2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 2015-06-26 12:12:35 [scrapy] INFO: Spider opened 2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request request = next(slot.start_requests) File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests yield self.make_requests_from_url(url) File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url return Request(url, dont_filter=True) File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__ self._set_url(url) File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__) TypeError: Request url must be str or unicode, got NoneType: 2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished) 2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)} 2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished) </code></pre> The problems that I am currently facing-: <ul> <li> If you carefully see Line 1 and Line 6 of my output, the start_url that I passed to my spider got printed twice, even though I have written the print statement only once on Line 31 of my code (whose link that I gave above). Why is that happening, and that too with different values (Initial print statement output on Line 1 (of my output) gives the correct result, although the print statement output on Line 6 (of my output)? Not only this, even if i write - print 'hi' - then also it gets printed twice. Why is this happening? </li> <li> Next, if you see this line of my output-: TypeError: Request url must be str or unicode, got NoneType: Why is that coming (even though the links of the questions that I posted above, have written the same thing) ? I have no idea how to resolve it? I even tried `self.start_urls=[str(kwargs.get('start_url'))]` - then it gives the following output-: </li> </ul> <pre class="prettyprint"><code>http://bigbasket.com/ps/?q=apple 2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11 2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {} 2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState None 2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 2015-06-26 12:28:01 [scrapy] INFO: Spider opened 2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request request = next(slot.start_requests) File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests yield self.make_requests_from_url(url) File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url return Request(url, dont_filter=True) File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__ self._set_url(url) File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: None 2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished) 2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats: {'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350), 'log_count/DEBUG': 1, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)} 2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished) </code></pre> Please help me resolve the above 2 errors.

You need to pass your parameters on the <code>crawl</code> method of the <code>CrawlerProcess</code>, so you need to run it like this: <pre class="prettyprint"><code>crawler = CrawlerProcess(Settings()) crawler.crawl(BBSpider, start_url=url) crawler.start() </code></pre>

Passing Argument to Scrapy Spider from Python Script

Tags:

python

arguments

web-scraping

scrapy

scrapy-spider

I am mentioning only SOME of the questions that I have referred before posting this question (I currently don't have links to all of those questions that I had referred to, before posting this question)-:

Question 1
Question 2

I am able to run this code completely, if I don't pass the arguments and ask for an input from the user from the BBSpider Class (without the main function - ust below the name="dmoz" line), or provide them as pre-defined (i.e, static) arguments.

My code is here.

I am basically trying to execute a Scrapy spider from a Python Script without the requirement of any additional files (even the Settings File). That is why, I have specified the settings also inside the code itself.

This is the output that I am getting on executing this script-:

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

The problems that I am currently facing-:

If you carefully see Line 1 and Line 6 of my output, the start_url that I passed to my spider got printed twice, even though I have written the print statement only once on Line 31 of my code (whose link that I gave above). Why is that happening, and that too with different values (Initial print statement output on Line 1 (of my output) gives the correct result, although the print statement output on Line 6 (of my output)? Not only this, even if i write - print 'hi' - then also it gets printed twice. Why is this happening?
Next, if you see this line of my output-: TypeError: Request url must be str or unicode, got NoneType: Why is that coming (even though the links of the questions that I posted above, have written the same thing) ? I have no idea how to resolve it? I even tried `self.start_urls=[str(kwargs.get('start_url'))]` - then it gives the following output-:

http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

Please help me resolve the above 2 errors.

429

asked Jun 26 '15 07:06

Ashutosh Saboo

1 Answers

You need to pass your parameters on the crawl method of the CrawlerProcess, so you need to run it like this:

crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()

149

answered Nov 07 '22 10:11

eLRuLL

Related questions
                            
                                Python 2.7 - Tweepy - How to get rate_limit_status()?
                            
                                Pandas read multiindexed csv with blanks
                            
                                How to set the axis limit in a matplotlib plt.polar plot
                            
                                Python 3 - Global Variables with AsyncIO/APScheduler
                            
                                Compile a Cython project and clean
                            
                                Replace data from one pandas dataframe to another
                            
                                Similar library of Scapy for C++ [closed]
                            
                                Pandas isin() function for continuous intervals
                            
                                list attribute has no order by
                            
                                Clean xml ==> Remove line if any empty tags
                            
                                How to use tkinter filedialog without a window
                            
                                Discard stdout / stderr of program under test, but keep unittest output
                            
                                Mocking assert_called_with in Python
                            
                                Celery 'module' object has no attribute 'app' when using Python 3
                            
                                Load nifti image with vtk ()
                            
                                Django User.check_password wouldn't pass password check
                            
                                Python Regex-- TypeError: an integer is required
                            
                                How to read MySQL timestamp(6) into pandas?
                            
                                Python - Convert stringified list back to list [duplicate]
                            
                                How to standardize/normalize a date with pandas/numpy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With