I am new to scrapy
and my task is simple:
For a given e-commerce website:
crawl all website pages
look for products page
If the URL point to a product page
Create an Item
Process the item to store it in a database
I created the spider but products are just printed in a simple file.
My question is about the project structure: how to use items in spider and how to send items to pipelines ?
I can't find a simple example of a project using items and pipelines.
The syntax is a little different than when not using Items, and the newly defined class needs to be imported into the Spider class. The highlighted lines of code show how to import the new Item class, instantiate an item object from it, populate each field, then yield the populated item object.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
Well, the main purpose of items is to store the data you crawled. scrapy.Items
are basically dictionaries. To declare your items, you will have to create a class and add scrapy.Field
in it:
import scrapy
class Product(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
You can now use it in your spider by importing your Product.
For advanced information, I let you check the doc here
First, you need to tell to your spider to use your custom pipeline
.
In the settings.py file:
ITEM_PIPELINES = {
'myproject.pipelines.CustomPipeline': 300,
}
You can now write your pipeline and play with your item.
In the pipeline.py file:
from scrapy.exceptions import DropItem
class CustomPipeline(object):
def __init__(self):
# Create your database connection
def process_item(self, item, spider):
# Here you can index your item
return item
Finally, in your spider, you need to yield
your item once it is filled.
spider.py example:
import scrapy
from myspider.items import Product
class MySpider(scrapy.Spider):
name = "test"
start_urls = ['http://www.exemple.com']
def parse(self, response):
doc = Product()
doc['url'] = response.url
doc['title'] = response.xpath('//div/p/text()')
yield doc # Will go to your pipeline
Hope this helps, here is the doc for pipelines: Item Pipeline
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With