Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run Scrapy project in Jupyter?

On a Mac, I have Jupyter installed and when I type jupyter notebook from the root folder of my Scrapy project, it opens the notebook. I can browse all of the project files at this point.

How do I execute the project from the notebook?

If I click the Running tab, under Terminals, I see:

There are no terminals running.
like image 586
4thSpace Avatar asked Nov 29 '16 02:11

4thSpace


People also ask

Can I run Scrapy in Jupyter notebook?

Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner.

How do you run the Scrapy project?

Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]

How do I run a Scrapy file in Python?

Basic ScriptThe key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

How do I start a new Scrapy project?

To begin the project, we can run the scrapy startproject command along with the name we will call the project. The target website is located at https://books.toscrape.com. We can open the project in PyCharm and the project folder structure should look familiar to you at this point.


2 Answers

There are two main ways to achieve that:

1. Under the Files tab open a new terminal: New > Terminal
Then simply run you spider: scrapy crawl [options] <spider>

2. Create a new notebook and use CrawlerProcess or CrawlerRunner classes to run in a cell:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl('your-spider')
process.start() # the script will block here until the crawling is finished

Scrapy docs - Run Scrapy from a script

like image 105
Paulo Romeira Avatar answered Sep 24 '22 23:09

Paulo Romeira


No Need of Terminal to run Spyder Class. Just add the following code in your jupyter-notebook cell:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

For more information see here

like image 29
susan097 Avatar answered Sep 21 '22 23:09

susan097