On a Mac, I have Jupyter installed and when I type jupyter notebook
from the root folder of my Scrapy project, it opens the notebook. I can browse all of the project files at this point.
How do I execute the project from the notebook?
If I click the Running tab, under Terminals, I see:
There are no terminals running.
Scrapy is an open-source framework for extracting the data from websites. It is fast, simple, and extensible. Every data scientist should have familiarity with this, as they often need to gather data in this manner.
Using the scrapy tool You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]
Basic ScriptThe key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
To begin the project, we can run the scrapy startproject command along with the name we will call the project. The target website is located at https://books.toscrape.com. We can open the project in PyCharm and the project folder structure should look familiar to you at this point.
There are two main ways to achieve that:
1.
Under the Files tab open a new terminal: New > Terminal
Then simply run you spider: scrapy crawl [options] <spider>
2.
Create a new notebook and use CrawlerProcess
or CrawlerRunner
classes to run in a cell:
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('your-spider')
process.start() # the script will block here until the crawling is finished
Scrapy docs - Run Scrapy from a script
No Need of Terminal to run Spyder Class. Just add the following code in your jupyter-notebook
cell:
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
For more information see here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With