I am new to <code>scrapy</code> and my task is simple: For a given e-commerce website: <ul> <li>crawl all website pages</li> <li>look for products page </li> <li>If the URL point to a product page </li> <li>Create an Item</li> <li>Process the item to store it in a database</li> </ul> I created the spider but products are just printed in a simple file. My question is about the project structure: how to use items in spider and how to send items to pipelines ? I can't find a simple example of a project using items and pipelines.

<ul> <li>How to use items in my spider?</li> </ul> Well, the main purpose of items is to store the data you crawled. <code>scrapy.Items</code> are basically dictionaries. To declare your items, you will have to create a class and add <code>scrapy.Field</code> in it: <pre class="prettyprint"><code>import scrapy class Product(scrapy.Item): url = scrapy.Field() title = scrapy.Field() </code></pre> You can now use it in your spider by importing your Product. For advanced information, I let you check the doc here <ul> <li>How to send items to the pipeline ?</li> </ul> First, you need to tell to your spider to use your <code>custom pipeline</code>. In the settings.py file: <pre class="prettyprint"><code>ITEM_PIPELINES = { 'myproject.pipelines.CustomPipeline': 300, } </code></pre> You can now write your pipeline and play with your item. In the pipeline.py file: <pre class="prettyprint"><code>from scrapy.exceptions import DropItem class CustomPipeline(object): def __init__(self): # Create your database connection def process_item(self, item, spider): # Here you can index your item return item </code></pre> Finally, in your spider, you need to <code>yield</code> your item once it is filled. spider.py example: <pre class="prettyprint"><code>import scrapy from myspider.items import Product class MySpider(scrapy.Spider): name = "test" start_urls = ['http://www.exemple.com'] def parse(self, response): doc = Product() doc['url'] = response.url doc['title'] = response.xpath('//div/p/text()') yield doc # Will go to your pipeline </code></pre> Hope this helps, here is the doc for pipelines: Item Pipeline

Scrapy: how to use items in spider and how to send items to pipelines?

1 Answers

How to use items in my spider?

Well, the main purpose of items is to store the data you crawled. scrapy.Items are basically dictionaries. To declare your items, you will have to create a class and add scrapy.Field in it:

import scrapy

class Product(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()

You can now use it in your spider by importing your Product.

For advanced information, I let you check the doc here

How to send items to the pipeline ?

First, you need to tell to your spider to use your custom pipeline.

In the settings.py file:

ITEM_PIPELINES = {
    'myproject.pipelines.CustomPipeline': 300,
}

You can now write your pipeline and play with your item.

In the pipeline.py file:

from scrapy.exceptions import DropItem

class CustomPipeline(object):
    def __init__(self):
        # Create your database connection

    def process_item(self, item, spider):
        # Here you can index your item
        return item

Finally, in your spider, you need to yield your item once it is filled.

spider.py example:

import scrapy
from myspider.items import Product

class MySpider(scrapy.Spider):
    name = "test"
    start_urls = ['http://www.exemple.com']

    def parse(self, response):
        doc = Product()
        doc['url'] = response.url
        doc['title'] = response.xpath('//div/p/text()')
        yield doc # Will go to your pipeline

Hope this helps, here is the doc for pipelines: Item Pipeline

answered Oct 21 '22 08:10

Adrien Blanquer

Related questions
                            
                                datetime.datetime is not JSON serializable [duplicate]
                            
                                Python: how to combine two flat lists into a 2D array? [duplicate]
                            
                                Rename Dataframe column based on column index
                            
                                UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte, while reading csv file in pandas
                            
                                Can't install csv module
                            
                                Finding longest run in a list
                            
                                Why does an imported function "as" another name keep its original __name__?
                            
                                GAE: unit testing taskqueue with testbed
                            
                                using lxml and iterparse() to parse a big (+- 1Gb) XML file
                            
                                Activating a virtual env not working
                            
                                Python pip segfault when installing package
                            
                                How to extract table names and column names from sql query?
                            
                                TensorFlow: cast a float64 tensor to float32
                            
                                Reset SQLite database in Django
                            
                                Unable to run unittest's main function in ipython/jupyter notebook
                            
                                Plot PCA loadings and loading in biplot in sklearn (like R's autoplot)
                            
                                How to check if any value of a column is in a range (in between two values) in Pandas?
                            
                                Concatenating string by rows in pyspark
                            
                                Time difference within group by objects in Python Pandas
                            
                                Apply 'wrap_text' to all cells using openpyxl

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy: how to use items in spider and how to send items to pipelines?

Tags:

python

scrapy

scrapy-spider

scrapy-pipeline

farhawa

People also ask

1 Answers

Adrien Blanquer

Recent Activity

Donate For Us