Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapyd: where do I get to see the output of my crawler once i schedule it using scrapyd

I am new to scrapy and scrapyd. Did some reading and developed my crawler which crawls a news website and gives me all the news articles from it. If I run the crawler simply by

scrapy crawl project name -o something.txt

It gives me all scraped data in something.txt correctly.

Now I tried deploying my scrapy crawler project on localhost:6800 using scrapyd.

And I schduled the crawler using

curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz_spider

it gives me this on command line

{"status": "ok", "jobid": "545dfcf092de11e3ad8b0013d43164b8"}

which is I think is correct and I am even able to see my cralwer as a job on UI view of localhost:6800

But where do I find the data that is scraped by my crawler which I used to collect previously in something.txt.

Please help....

this is my crawler code

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["timesofindia.com"]
    start_urls = ["http://mobiletoi.timesofindia.com/htmldbtoi/TOIPU/20140206/TOIPU_articles__20140206.html"]

    def parse(self, response):
    sel = Selector(response)
        torrent = DmozItem()
    items=[]
    links = sel.xpath('//div[@class="gapleftm"]/ul[@class="content"]/li')
        sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/stname/text()").extract()
    sel.xpath("//div[@class='gapleftm']/ul[@class='content']/li/b/a/@href").extract()

    for ti in sel.xpath("//a[@class='pda']/text()").extract():
        yield DmozItem(title=ti)
    for url in sel.xpath("//a[@class='pda']/@href").extract():
        itemLink = urlparse.urljoin(response.url, url)  
        yield DmozItem(link=url)    
        yield Request(itemLink, callback=self.my_parse)

    def my_parse(self, response):
    sel = Selector(response)
    self.log('A response from my_parse just arrived!')
    for head in sel.xpath("//b[@class='pda']/text()").extract():
        yield DmozItem(heading=head)
    for text in sel.xpath("//a[@class='pda']/text()").extract():
        yield DmozItem(desc=text)
    for url_desc in sel.xpath("//a[@class='pda']/@href").extract():
        itemLinkDesc = urlparse.urljoin(response.url, url_desc) 
        yield DmozItem(link=url_desc)   
        yield Request(itemLinkDesc, callback=self.my_parse_desc)

    def my_parse_desc(self, response):
        sel = Selector(response)
        self.log('ENTERED ITERATION OF MY_PARSE_DESC!')
        for bo in sel.xpath("//font[@class='pda']/text()").extract():
            yield DmozItem(body=bo) 
like image 761
Yogesh D Avatar asked Feb 11 '14 05:02

Yogesh D


1 Answers

When using the feed exports you define where to store the feed using a URI (through the FEED_URI setting). The feed exports supports multiple storage backend types which are defined by the URI scheme.

curl http://localhost:6800/schedule.json -d project=tutorial -d spider=dmoz_spider -d setting=FEED_URI=file:///path/to/output.json
like image 158
kev Avatar answered Oct 21 '22 18:10

kev