This is my simple code and i am not getting it work.
I am subclassing from initspider
This is my code
class MytestSpider(InitSpider):
name = 'mytest'
allowed_domains = ['example.com']
login_page = 'http://www.example.com'
start_urls = ["http://www.example.com/ist.php"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.parse)
def parse(self, response):
item = MyItem()
item['username'] = "mytest"
return item
class TestPipeline(object):
def process_item(self, item, spider):
print item['username']
i am geting same error if try to print the item
The error i get is
File "crawler/pipelines.py", line 35, in process_item
myitem.username = item['username']
exceptions.TypeError: 'NoneType' object has no attribute '__getitem__'
I the problem is with InitSpider
. My pieplines are not getting item objects
class MyItem(Item):
username = Field()
BOT_NAME = 'crawler'
SPIDER_MODULES = ['spiders']
NEWSPIDER_MODULE = 'spiders'
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': 700 # <-
}
COOKIES_ENABLED = True
COOKIES_DEBUG = True
ITEM_PIPELINES = [
'pipelines.TestPipeline',
]
IMAGES_STORE = '/var/www/htmlimages'
You can activate an Item Pipeline component by adding its class to the ITEM_PIPELINES setting as shown in the following code. You can assign integer values to the classes in the order in which they run (the order can be lower valued to higher valued classes) and values will be in the 0-1000 range.
Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
Each item pipeline component (sometimes referred as just “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also deciding if the item should continue through the pipeline or be dropped and no longer processed.
pipelines.TestPipeline
is missing order number. It should be something like ITEM_PIPELINES = {'pipelines.TestPipeline': 900}
.
There's another issue with your process_item
function. According to the official documentation:
This method is called for every item pipeline component and must either return a dict with data, Item (or any descendant class) object or raise a DropItem exception. Dropped items are no longer processed by further pipeline components.
In your case, you could add a return statement at the end of your function:
def process_item(self, item, spider):
print item['username']
return item
If you don't include a return statement, the return value of this pipeline is None
. That's why the following pipeline complains--you can't do item['username']
when item
is None
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With