Scrapy Very Basic Example

Tags:

Hi I have Python Scrapy installed on my mac and I was trying to follow the very first example on their web.

They were trying to run the command:

scrapy crawl mininova.org -o scraped_data.json -t json

I don't quite understand what does this mean? looks like scrapy turns out to be a separate program. And I don't think they have a command called crawl. In the example, they have a paragraph of code, which is the definition of the class MininovaSpider and the TorrentItem. I don't know where these two classes should go to, go to the same file and what is the name of this python file?

554

asked Sep 16 '13 22:09

B.Mr.W.

1 Answers

TL;DR: see Self-contained minimum example script to run scrapy.

First of all, having a normal Scrapy project with a separate .cfg, settings.py, pipelines.py, items.py, spiders package etc is a recommended way to keep and handle your web-scraping logic. It provides a modularity, separation of concerns that keeps things organized, clear and testable.

If you are following the official Scrapy tutorial to create a project, you are running web-scraping via a special scrapy command-line tool:

Click to copy

scrapy crawl myspider

But, Scrapy also provides an API to run crawling from a script.

There are several key concepts that should be mentioned:

Settings class - basically a key-value "container" which is initialized with default built-in values
Crawler class - the main class that acts like a glue for all the different components involved in web-scraping with Scrapy
Twisted reactor - since Scrapy is built-in on top of twisted asynchronous networking library - to start a crawler, we need to put it inside the Twisted Reactor, which is in simple words, an event loop:

The reactor is the core of the event loop within Twisted – the loop which drives applications using Twisted. The event loop is a programming construct that waits for and dispatches events or messages in a program. It works by calling some internal or external “event provider”, which generally blocks until an event has arrived, and then calls the relevant event handler (“dispatches the event”). The reactor provides basic interfaces to a number of services, including network communications, threading, and event dispatching.

Here is a basic and simplified process of running Scrapy from script:

create a Settings instance (or use get_project_settings() to use existing settings):

Click to copy
```
settings = Settings()  # or settings = get_project_settings() 
```
instantiate Crawler with settings instance passed in:

Click to copy
```
crawler = Crawler(settings) 
```
instantiate a spider (this is what it is all about eventually, right?):

Click to copy
```
spider = MySpider() 
```
configure signals. This is an important step if you want to have a post-processing logic, collect stats or, at least, to ever finish crawling since the twisted reactor needs to be stopped manually. Scrapy docs suggest to stop the reactor in the spider_closed signal handler:

Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by connecting a handler to the signals.spider_closed signal.

Click to copy

def callback(spider, reason):     stats = spider.crawler.stats.get_stats()     # stats here is a dictionary of crawling stats that you usually see on the console              # here we need to stop the reactor     reactor.stop()  crawler.signals.connect(callback, signal=signals.spider_closed)

configure and start crawler instance with a spider passed in:

Click to copy
```
crawler.configure() crawler.crawl(spider) crawler.start() 
```
optionally start logging:

Click to copy
```
log.start() 
```
start the reactor - this would block the script execution:

Click to copy
```
reactor.run() 
```

Here is an example self-contained script that is using DmozSpider spider and involves item loaders with input and output processors and item pipelines:

Click to copy

import json  from scrapy.crawler import Crawler from scrapy.contrib.loader import ItemLoader from scrapy.contrib.loader.processor import Join, MapCompose, TakeFirst from scrapy import log, signals, Spider, Item, Field from scrapy.settings import Settings from twisted.internet import reactor   # define an item class class DmozItem(Item):     title = Field()     link = Field()     desc = Field()   # define an item loader with input and output processors class DmozItemLoader(ItemLoader):     default_input_processor = MapCompose(unicode.strip)     default_output_processor = TakeFirst()      desc_out = Join()   # define a pipeline class JsonWriterPipeline(object):     def __init__(self):         self.file = open('items.jl', 'wb')      def process_item(self, item, spider):         line = json.dumps(dict(item)) + "\n"         self.file.write(line)         return item   # define a spider class DmozSpider(Spider):     name = "dmoz"     allowed_domains = ["dmoz.org"]     start_urls = [         "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",         "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"     ]      def parse(self, response):         for sel in response.xpath('//ul/li'):             loader = DmozItemLoader(DmozItem(), selector=sel, response=response)             loader.add_xpath('title', 'a/text()')             loader.add_xpath('link', 'a/@href')             loader.add_xpath('desc', 'text()')             yield loader.load_item()   # callback fired when the spider is closed def callback(spider, reason):     stats = spider.crawler.stats.get_stats()  # collect/log stats?      # stop the reactor     reactor.stop()   # instantiate settings and provide a custom configuration settings = Settings() settings.set('ITEM_PIPELINES', {     '__main__.JsonWriterPipeline': 100 })  # instantiate a crawler passing in settings crawler = Crawler(settings)  # instantiate a spider spider = DmozSpider()  # configure signals crawler.signals.connect(callback, signal=signals.spider_closed)  # configure and start the crawler crawler.configure() crawler.crawl(spider) crawler.start()  # start logging log.start()  # start the reactor (blocks execution) reactor.run()

Run it in a usual way:

Click to copy

python runner.py

and observe items exported to items.jl with the help of the pipeline:

Click to copy

{"desc": "", "link": "/", "title": "Top"} {"link": "/Computers/", "title": "Computers"} {"link": "/Computers/Programming/", "title": "Programming"} {"link": "/Computers/Programming/Languages/", "title": "Languages"} {"link": "/Computers/Programming/Languages/Python/", "title": "Python"} ...

Gist is available here (feel free to improve):

Self-contained minimum example script to run scrapy

Notes:

If you define settings by instantiating a Settings() object - you'll get all the defaults Scrapy settings. But, if you want to, for example, configure an existing pipeline, or configure a DEPTH_LIMIT or tweak any other setting, you need to either set it in the script via settings.set() (as demonstrated in the example):

Click to copy

pipelines = {     'mypackage.pipelines.FilterPipeline': 100,     'mypackage.pipelines.MySQLPipeline': 200 } settings.set('ITEM_PIPELINES', pipelines, priority='cmdline')

or, use an existing settings.py with all the custom settings preconfigured:

Click to copy

from scrapy.utils.project import get_project_settings  settings = get_project_settings()

alecxe

Related questions
                            
                                Undefined symbols for architecture error when deployment target is 7.0
                            
                                android in app billing purchase verification failed
                            
                                bootstrap 3 - how do I place the brand in the center of the navbar?
                            
                                Leaflet map not showing properly in bootstrap 3.0 modal
                            
                                Why main can not be a constexpr?
                            
                                Curl returns "Unknown protocol"
                            
                                C# Compare two dictionaries for equality
                            
                                Could not calculate build plan: Plugin org.apache.maven.plugins:maven-jar-plugin:2.3.2 or one of its dependencies could not be resolved
                            
                                .gitignore regex for emacs temporary files
                            
                                Converting a const char * to std::string [duplicate]
                            
                                Listening to all scroll events on a page
                            
                                Singler line FFMPEG cmd to Merge Video /Audio and retain both audios

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy Very Basic Example

Tags:

B.Mr.W.

People also ask

1 Answers

alecxe

Recent Activity

Donate For Us