<p>This is my simple code and i am not getting it work.</p> <p>I am subclassing from <code>initspider</code></p> <p>This is my code</p> <pre class="prettyprint"><code>class MytestSpider(InitSpider): name = 'mytest' allowed_domains = ['example.com'] login_page = 'http://www.example.com' start_urls = ["http://www.example.com/ist.php"] def init_request(self): #"""This function is called before crawling starts.""" return Request(url=self.login_page, callback=self.parse) def parse(self, response): item = MyItem() item['username'] = "mytest" return item </code></pre> <h3>Pipeline</h3> <pre class="prettyprint"><code>class TestPipeline(object): def process_item(self, item, spider): print item['username'] </code></pre> <p><strong>i am geting same error if try to print the item</strong></p> <p>The error i get is</p> <pre class="prettyprint"><code> File "crawler/pipelines.py", line 35, in process_item myitem.username = item['username'] exceptions.TypeError: 'NoneType' object has no attribute '__getitem__' </code></pre> <p>I the problem is with <code>InitSpider</code> . My pieplines are not getting item objects</p> <h3>items.py</h3> <pre class="prettyprint"><code>class MyItem(Item): username = Field() </code></pre> <h3>setting.py</h3> <pre class="prettyprint"><code>BOT_NAME = 'crawler' SPIDER_MODULES = ['spiders'] NEWSPIDER_MODULE = 'spiders' DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700 # <- } COOKIES_ENABLED = True COOKIES_DEBUG = True ITEM_PIPELINES = [ 'pipelines.TestPipeline', ] IMAGES_STORE = '/var/www/htmlimages' </code></pre>

<p><code>pipelines.TestPipeline</code> is missing order number. It should be something like <code>ITEM_PIPELINES = {'pipelines.TestPipeline': 900}</code>.</p>

<p>There's another issue with your <code>process_item</code> function. According to the official documentation:</p> <blockquote> <p>This method is called for every item pipeline component and must either return a dict with data, Item (or any descendant class) object or raise a DropItem exception. Dropped items are no longer processed by further pipeline components.</p> </blockquote> <p>In your case, you could add a return statement at the end of your function:</p> <pre class="prettyprint"><code>def process_item(self, item, spider): print item['username'] return item </code></pre> <p>If you don't include a return statement, the return value of this pipeline is <code>None</code>. That's why the following pipeline complains--you can't do <code>item['username']</code> when <code>item</code> is <code>None</code>.</p>

Can not get simplest pipeline example to work in scrapy

Tags:

python

scrapy

This is my simple code and i am not getting it work.

I am subclassing from initspider

This is my code

class MytestSpider(InitSpider):
    name = 'mytest'
    allowed_domains = ['example.com']
    login_page = 'http://www.example.com'
    start_urls = ["http://www.example.com/ist.php"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.parse)

    def parse(self, response):
        item = MyItem()
        item['username'] = "mytest"
        return item

Pipeline

class TestPipeline(object):
    def process_item(self, item, spider):
            print item['username']

i am geting same error if try to print the item

The error i get is

 File "crawler/pipelines.py", line 35, in process_item
            myitem.username = item['username']
        exceptions.TypeError: 'NoneType' object has no attribute '__getitem__'

I the problem is with InitSpider . My pieplines are not getting item objects

items.py

class MyItem(Item):
    username = Field()

setting.py

BOT_NAME = 'crawler'

SPIDER_MODULES = ['spiders']
NEWSPIDER_MODULE = 'spiders'


DOWNLOADER_MIDDLEWARES = {

    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700 # <-
}

COOKIES_ENABLED = True
COOKIES_DEBUG = True


ITEM_PIPELINES = [

'pipelines.TestPipeline',


]

IMAGES_STORE = '/var/www/htmlimages'

997

asked Dec 15 '12 09:12

user1858027

2 Answers

pipelines.TestPipeline is missing order number. It should be something like ITEM_PIPELINES = {'pipelines.TestPipeline': 900}.

answered Oct 13 '22 01:10

vytotas

There's another issue with your process_item function. According to the official documentation:

This method is called for every item pipeline component and must either return a dict with data, Item (or any descendant class) object or raise a DropItem exception. Dropped items are no longer processed by further pipeline components.

In your case, you could add a return statement at the end of your function:

def process_item(self, item, spider):
    print item['username']
    return item

If you don't include a return statement, the return value of this pipeline is None. That's why the following pipeline complains--you can't do item['username'] when item is None.

answered Oct 13 '22 01:10

dxue2012

Related questions
                            
                                "find . -regex ..." in Python or How to find files whose whole name (path + name) matches a regular expression?
                            
                                A scrollbar event when scrolling?
                            
                                Using PIL to fill empty image space with nearby colors (aka inpainting)
                            
                                How do I serialize a Java object such that it can be deserialized by pickle (Python)?
                            
                                Django debug toolbar setup
                            
                                How to organize and run unittests and functional tests separately using nosetests
                            
                                Blank line rule at interactive prompt
                            
                                Python pdb (debugger) disp equivalent?
                            
                                Export GMail Contacts via Unattended Script
                            
                                segmented linear regression in python
                            
                                How Can I Downgrade from Python 3.2 to 2.7?
                            
                                Django: Display values of the selected multiple choice field in a template
                            
                                pandas reading csv orientation
                            
                                Image resize using PIL changes colors drastically
                            
                                PGP-signing multipart e-mails with Python
                            
                                How to change the default version of python in a linux machine ?(not just symlink) [closed]
                            
                                Using git to Track changes to dropbox?
                            
                                matplotlib: faster PDF generation?
                            
                                using python urllib2 to send POST request and get response
                            
                                Python/Regex - Match .#,#. in String

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With