The stockInfo.py
contains:
import scrapy
import re
import pkgutil
class QuotesSpider(scrapy.Spider):
name = "stockInfo"
data = pkgutil.get_data("tutorial", "resources/urls.txt")
data = data.decode()
start_urls = data.split("\r\n")
def parse(self, response):
company = re.findall("[0-9]{6}",response.url)[0]
filename = '%s_info.html' % company
with open(filename, 'wb') as f:
f.write(response.body)
To execute the spider stockInfo
in window's cmd.
d:
cd tutorial
scrapy crawl stockInfo
Now all webpage of the url in resources/urls.txt
will downloaded on the directory d:/tutorial
.
Then to deploy the spider into Scrapinghub
,and run stockInfo spider
.
No error occur,where is the downloaded webpage?
How the following command lines executed in Scrapinghub
?
with open(filename, 'wb') as f:
f.write(response.body)
How can i save the data in scrapinghub,and download it from scrapinghub when job is finished?
To install scrapinghub at first.
pip install scrapinghub[msgpack]
Rewrite as Thiago Curvelo
say,adn deploy it in my scrapinghub.
Deploy log location: C:\Users\dreams\AppData\Local\Temp\shub_deploy_yzstvtj8.log
Error: Deploy failed: b'{"status": "error", "message": "Internal error"}'
_get_apisettings, commands_module='sh_scrapy.commands')
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
_run(args, settings)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 103, in _run
_run_scrapy(args, settings)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 111, in _run_scrapy
execute(settings=settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 148, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
return cls(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/app/__main__.egg/mySpider/spiders/stockInfo.py", line 4, in <module>
ImportError: cannot import name ScrapinghubClient
{"message": "shub-image-info exit code: 1", "details": null, "error": "image_info_error"}
{"status": "error", "message": "Internal error"}
The requirements.txt contains only one line:
scrapinghub[msgpack]
The scrapinghub.yml contains:
project: 123456
requirements:
file: requirements.tx
Now deploy it.
D:\mySpider>shub deploy 123456
Packing version 1.0
Deploying to Scrapy Cloud project "123456"
Deploy log last 30 lines:
Deploy log location: C:\Users\dreams\AppData\Local\Temp\shub_deploy_4u7kb9ml.log
Error: Deploy failed: b'{"status": "error", "message": "Internal error"}'
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
_run(args, settings)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 103, in _run
_run_scrapy(args, settings)
File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 111, in _run_scrapy
execute(settings=settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 148, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
super(CrawlerProcess, self).__init__(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
self.spider_loader = _get_spider_loader(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
return cls(settings)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
self._load_all_spiders()
File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/tmp/unpacked-eggs/__main__.egg/mySpider/spiders/stockInfo.py", line 5, in <module>
from scrapinghub import ScrapinghubClient
ImportError: cannot import name ScrapinghubClient
{"message": "shub-image-info exit code: 1", "details": null, "error": "image_info_error"}
{"status": "error", "message": "Internal error"}
1.issue remains.
ImportError: cannot import name ScrapinghubClient
2.only python3.7 and win7 installed on my local pc,why the error info say:
File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
Is the error info on the scrapinghub(remote end)?just send to my local end to show ?
Write data into disk in a cloud environment isn't reliable these days since everybody is using containers and they are ephemeral.
But you could save your data using Scrapinghub's Collection API. You can use it directly through the endpoints or using this wrapper: https://python-scrapinghub.readthedocs.io/en/latest/
With python-scrapinghub
, your code would look like this:
from scrapinghub import ScrapinghubClient
from contextlib import closing
project_id = '12345'
apikey = 'XXXX'
client = ScrapinghubClient(apikey)
store = client.get_project(project_id).collections.get_store('mystuff')
#...
def parse(self, response):
company = re.findall("[0-9]{6}",response.url)[0]
with closing(store.create_writer()) as writer:
writer.write({
'_key': company,
'body': response.body}
)
After saving something into a collection, a link will appear in your dashboard:
EDIT:
To make sure the dependencies will be installed in the cloud (scrapinghub[msgpack]
), add them to your requirements.txt
or Pipfile
and include it in the scrapinghub.yml
file. Eg:
# project_directory/scrapinghub.yml
projects:
default: 12345
stacks:
default: scrapy:1.5-py3
requirements:
file: requirements.txt
(https://shub.readthedocs.io/en/stable/deploying.html#deploying-dependencies)
Thus, scrapinghub (the cloud service) will install scrapinghub (the python library). :)
I hope it helps you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With