Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can not get simplest pipeline example to work in scrapy

Tags:

python

scrapy

This is my simple code and i am not getting it work.

I am subclassing from initspider

This is my code

class MytestSpider(InitSpider):
    name = 'mytest'
    allowed_domains = ['example.com']
    login_page = 'http://www.example.com'
    start_urls = ["http://www.example.com/ist.php"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.parse)

    def parse(self, response):
        item = MyItem()
        item['username'] = "mytest"
        return item

Pipeline

class TestPipeline(object):
    def process_item(self, item, spider):
            print item['username']

i am geting same error if try to print the item

The error i get is

 File "crawler/pipelines.py", line 35, in process_item
            myitem.username = item['username']
        exceptions.TypeError: 'NoneType' object has no attribute '__getitem__'

I the problem is with InitSpider . My pieplines are not getting item objects

items.py

class MyItem(Item):
    username = Field()

setting.py

BOT_NAME = 'crawler'

SPIDER_MODULES = ['spiders']
NEWSPIDER_MODULE = 'spiders'


DOWNLOADER_MIDDLEWARES = {

    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700 # <-
}

COOKIES_ENABLED = True
COOKIES_DEBUG = True


ITEM_PIPELINES = [

'pipelines.TestPipeline',


]

IMAGES_STORE = '/var/www/htmlimages'
like image 997
user1858027 Avatar asked Dec 15 '12 09:12

user1858027


People also ask

How do you activate the pipeline in scrapy?

You can activate an Item Pipeline component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range.

How does a scrapy pipeline work?

Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.

How do you run multiple spiders in a scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

What is Item pipeline in scrapy?

Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.


2 Answers

pipelines.TestPipeline is missing order number. It should be something like ITEM_PIPELINES = {'pipelines.TestPipeline': 900}.

like image 51
vytotas Avatar answered Oct 13 '22 01:10

vytotas


There's another issue with your process_item function. According to the official documentation:

This method is called for every item pipeline component and must either return a dict with data, Item (or any descendant class) object or raise a DropItem exception. Dropped items are no longer processed by further pipeline components.

In your case, you could add a return statement at the end of your function:

def process_item(self, item, spider):
    print item['username']
    return item

If you don't include a return statement, the return value of this pipeline is None. That's why the following pipeline complains--you can't do item['username'] when item is None.

like image 40
dxue2012 Avatar answered Oct 13 '22 01:10

dxue2012