Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the best way to scrape multiple domains with scrapy?

I have around 10 odd sites that I wish to scrape from. A couple of them are wordpress blogs and they follow the same html structure, albeit with different classes. The others are either forums or blogs of other formats.

The information I like to scrape is common - the post content, the timestamp, the author, title and the comments.

My question is, do i have to create one separate spider for each domain? If not, how can I create a generic spider that allows me scrape by loading options from a configuration file or something similar?

I figured I could load the xpath expressions from a file which location can be loaded via command line but there seems to be some difficulties when scraping for some domain requires that I use regex select(expression_here).re(regex) while some do not.

like image 515
goh Avatar asked Mar 31 '11 08:03

goh


2 Answers

At scrapy spider set the allowed_domains to a list of domains for example :

class YourSpider(CrawlSpider):    
   allowed_domains = [ 'domain1.com','domain2.com' ]

hope it helps

like image 180
llazzaro Avatar answered Oct 05 '22 01:10

llazzaro


Well I faced the same issue so I created the spider class dynamically using type(),

from scrapy.contrib.spiders import CrawlSpider
import urlparse

class GenericSpider(CrawlSpider):
    """a generic spider, uses type() to make new spider classes for each domain"""
    name = 'generic'
    allowed_domains = []
    start_urls = []

    @classmethod
    def create(cls, link):
        domain = urlparse.urlparse(link).netloc.lower()
        # generate a class name such that domain www.google.com results in class name GoogleComGenericSpider
        class_name = (domain if not domain.startswith('www.') else domain[4:]).title().replace('.', '') + cls.__name__
        return type(class_name, (cls,), {
            'allowed_domains': [domain],
            'start_urls': [link],
            'name': domain
        })

So say, to create a spider for 'http://www.google.com' I'll just do -

In [3]: google_spider = GenericSpider.create('http://www.google.com')

In [4]: google_spider
Out[4]: __main__.GoogleComGenericSpider

In [5]: google_spider.name
Out[5]: 'www.google.com'

Hope this helps

like image 36
Optimus Avatar answered Oct 05 '22 01:10

Optimus