Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save downloaded file when running spider on Scrapinghub?

Tags:

python

scrapy

The stockInfo.py contains:

import scrapy
import re
import pkgutil

class QuotesSpider(scrapy.Spider):
    name = "stockInfo"
    data = pkgutil.get_data("tutorial", "resources/urls.txt")
    data = data.decode()
    start_urls = data.split("\r\n")

    def parse(self, response):
        company = re.findall("[0-9]{6}",response.url)[0]
        filename = '%s_info.html' % company
        with open(filename, 'wb') as f:
            f.write(response.body)

To execute the spider stockInfo in window's cmd.

d:
cd  tutorial
scrapy crawl stockInfo

Now all webpage of the url in resources/urls.txt will downloaded on the directory d:/tutorial.

Then to deploy the spider into Scrapinghub,and run stockInfo spider.

enter image description here

No error occur,where is the downloaded webpage?
How the following command lines executed in Scrapinghub?

        with open(filename, 'wb') as f:
            f.write(response.body)

How can i save the data in scrapinghub,and download it from scrapinghub when job is finished?

To install scrapinghub at first.

pip install scrapinghub[msgpack]

Rewrite as Thiago Curvelo say,adn deploy it in my scrapinghub.

Deploy log location: C:\Users\dreams\AppData\Local\Temp\shub_deploy_yzstvtj8.log
Error: Deploy failed: b'{"status": "error", "message": "Internal error"}'
    _get_apisettings, commands_module='sh_scrapy.commands')
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
    _run(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 103, in _run
    _run_scrapy(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 111, in _run_scrapy
    execute(settings=settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 148, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/app/__main__.egg/mySpider/spiders/stockInfo.py", line 4, in <module>
ImportError: cannot import name ScrapinghubClient
{"message": "shub-image-info exit code: 1", "details": null, "error": "image_info_error"}
{"status": "error", "message": "Internal error"}

The requirements.txt contains only one line:

scrapinghub[msgpack]

The scrapinghub.yml contains:

project: 123456
requirements:
  file: requirements.tx

Now deploy it.

D:\mySpider>shub deploy 123456
Packing version 1.0
Deploying to Scrapy Cloud project "123456"
Deploy log last 30 lines:

Deploy log location: C:\Users\dreams\AppData\Local\Temp\shub_deploy_4u7kb9ml.log
Error: Deploy failed: b'{"status": "error", "message": "Internal error"}'
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 148, in _run_usercode
    _run(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 103, in _run
    _run_scrapy(args, settings)
  File "/usr/local/lib/python2.7/site-packages/sh_scrapy/crawl.py", line 111, in _run_scrapy
    execute(settings=settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/cmdline.py", line 148, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/usr/local/lib/python2.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/tmp/unpacked-eggs/__main__.egg/mySpider/spiders/stockInfo.py", line 5, in <module>
    from scrapinghub import ScrapinghubClient
ImportError: cannot import name ScrapinghubClient
{"message": "shub-image-info exit code: 1", "details": null, "error": "image_info_error"}
{"status": "error", "message": "Internal error"}     

1.issue remains.

ImportError: cannot import name ScrapinghubClient

2.only python3.7 and win7 installed on my local pc,why the error info say:

File "/usr/local/lib/python2.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules

Is the error info on the scrapinghub(remote end)?just send to my local end to show ?

like image 726
showkey Avatar asked Mar 16 '19 12:03

showkey


Video Answer


1 Answers

Write data into disk in a cloud environment isn't reliable these days since everybody is using containers and they are ephemeral.

But you could save your data using Scrapinghub's Collection API. You can use it directly through the endpoints or using this wrapper: https://python-scrapinghub.readthedocs.io/en/latest/

With python-scrapinghub, your code would look like this:

from scrapinghub import ScrapinghubClient
from contextlib import closing

project_id = '12345'
apikey = 'XXXX'
client = ScrapinghubClient(apikey)
store = client.get_project(project_id).collections.get_store('mystuff')

#...

    def parse(self, response):
        company = re.findall("[0-9]{6}",response.url)[0]
        with closing(store.create_writer()) as writer:
            writer.write({
                '_key': company, 
                'body': response.body}
            )        

After saving something into a collection, a link will appear in your dashboard:

collections

EDIT:

To make sure the dependencies will be installed in the cloud (scrapinghub[msgpack]), add them to your requirements.txt or Pipfile and include it in the scrapinghub.yml file. Eg:

# project_directory/scrapinghub.yml

projects:
  default: 12345

stacks:
  default: scrapy:1.5-py3

requirements:
  file: requirements.txt

(https://shub.readthedocs.io/en/stable/deploying.html#deploying-dependencies)

Thus, scrapinghub (the cloud service) will install scrapinghub (the python library). :)

I hope it helps you.

like image 169
Thiago Curvelo Avatar answered Sep 29 '22 14:09

Thiago Curvelo