Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with Scrapy Div Class

Tags:

python

scrapy

I am new to Scrapy and really python as well. I am trying to write a scraper that will extract article title, link and article description ALMOST like an RSS feed from a web page to help me with my thesis. I've written the following scraper and when I run it and export it as a .txt it comes back blank. I believe I need to add in an Item Loader but I am not positive.

Items.py

from scrapy.item import Item, Field

class NorthAfricaItem(Item):
    title = Field()
    link = Field()
    desc = Field()
    pass

Spider

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafricatutorial.items import NorthAfricaItem

class NorthAfricaItem(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

 def parse(self, response):
 hxs = HtmlXPathSelector(response)
 sites = hxs.select('//ul/li')
 items = []
 for site in sites:
     item = NorthAfricaItem()
     item['title'] = site.select('a/text()').extract()
     item['link'] = site.select('a/@href').extract()
     item['desc'] = site.select('text()').extract()
     items.append(item)
 return items

UPDATE

Thanks to Talvalin for the help and additionally with some messing around I was able to fix the problem. I was using a stock script that I found online. However once I utilized the shell I was able to find the correct tags to get what I needed. Ive ended up with:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafrica.items import NorthAfricaItem

class NorthAfricaSpider(BaseSpider):
   name = "northafrica"
   allowed_domains = ["http://www.north-africa.com/"]
   start_urls = [
       "http://www.north-africa.com/naj_news/news_na/index.1.html",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//ul/li')
       items = []
       for site in sites:
           item = NorthAfricaItem()
           item['title'] = site.select('//div[@class="short_holder"]    /h2/a/text()').extract()
       item['link'] = site.select('//div[@class="short_holder"]/h2/a/@href').extract()
       item['desc'] = site.select('//span[@class="summary"]/text()').extract()
       items.append(item)
   return items

If anyone sees anything here I did wrong let me know......but it works.

like image 835
Mike Avatar asked Nov 12 '22 12:11

Mike


1 Answers

The thing to note about this code is that it runs with an error. Try running the spider via the command line and you will see something along the lines of:

        exceptions.TypeError: 'NorthAfricaItem' object does not support item assignment

2013-01-24 16:43:35+0000 [northafrica] INFO: Closing spider (finished)

The reason why this error is occurring is because you've given your spider and your item classes the same name: NorthAfricaItem

In your spider code, when you create an instance of NorthAfricaItem to assign things to (like title, link and desc), the spider version takes precedence over the item version. Since the spider version of NorthAfricaItem is not actually a type of Item, the item assignment fails.

To fix the issue, rename your spider class to something like NorthAfricaSpider and the problem should be resolved.

like image 109
Talvalin Avatar answered Nov 15 '22 05:11

Talvalin