Following document, I can run scrapy from a Python script, but I can't get the scrapy result. This is my spider: <pre class="prettyprint"><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from items import DmozItem class DmozSpider(BaseSpider): name = "douban" allowed_domains = ["example.com"] start_urls = [ "http://www.example.com/group/xxx/discussion" ] def parse(self, response): hxs = HtmlXPathSelector(response) rows = hxs.select("//table[@class='olt']/tr/td[@class='title']/a") items = [] # print sites for row in rows: item = DmozItem() item["title"] = row.select('text()').extract()[0] item["link"] = row.select('@href').extract()[0] items.append(item) return items </code></pre> Notice the last line, I try to use the returned parse result, if I run: <pre class="prettyprint"><code> scrapy crawl douban </code></pre> the terminal could print the return result But I can't get the return result from the Python script. Here is my Python script: <pre class="prettyprint"><code>from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy.settings import Settings from scrapy import log, signals from spiders.dmoz_spider import DmozSpider from scrapy.xlib.pydispatch import dispatcher def stop_reactor(): reactor.stop() dispatcher.connect(stop_reactor, signal=signals.spider_closed) spider = DmozSpider(domain='www.douban.com') crawler = Crawler(Settings()) crawler.configure() crawler.crawl(spider) crawler.start() log.start() log.msg("------------>Running reactor") result = reactor.run() print result log.msg("------------>Running stoped") </code></pre> I try to get the result at the <code>reactor.run()</code>, but it return nothing, How can I get the result?

Terminal prints the result because the default log level is set to <code>DEBUG</code>. When you are running your spider from the script and call <code>log.start()</code>, the default log level is set to <code>INFO</code>. Just replace: <pre class="prettyprint"><code>log.start() </code></pre> with <pre class="prettyprint"><code>log.start(loglevel=log.DEBUG) </code></pre> UPD: To get the result as string, you can log everything to a file and then read from it, e.g.: <pre class="prettyprint"><code>log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False) reactor.run() with open("results.log", "r") as f: result = f.read() print result </code></pre> Hope that helps.

I found your question while asking myself the same thing, namely: "How can I get the result?". Since this wasn't answered here I endeavoured to find the answer myself and now that I have I can share it: <pre class="prettyprint lang-python prettyprint-override"><code>items = [] def add_item(item): items.append(item) dispatcher.connect(add_item, signal=signals.item_passed) </code></pre> Or for scrapy 0.22 (http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) replace the last line of my solution by: <pre class="prettyprint lang-python prettyprint-override"><code>crawler.signals.connect(add_item, signals.item_passed) </code></pre> My solution is freely adapted from http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/.

Confused about running Scrapy from within a Python script

Tags:

python

web-scraping

scrapy

Following document, I can run scrapy from a Python script, but I can't get the scrapy result.

This is my spider:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from items import DmozItem

class DmozSpider(BaseSpider):
    name = "douban" 
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/group/xxx/discussion"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select("//table[@class='olt']/tr/td[@class='title']/a")
        items = []
        # print sites
        for row in rows:
            item = DmozItem()
            item["title"] = row.select('text()').extract()[0]
            item["link"] = row.select('@href').extract()[0]
            items.append(item)

        return items

Notice the last line, I try to use the returned parse result, if I run:

 scrapy crawl douban

the terminal could print the return result

But I can't get the return result from the Python script. Here is my Python script:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from spiders.dmoz_spider import DmozSpider
from scrapy.xlib.pydispatch import dispatcher

def stop_reactor():
    reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = DmozSpider(domain='www.douban.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg("------------>Running reactor")
result = reactor.run()
print result
log.msg("------------>Running stoped")

I try to get the result at the reactor.run(), but it return nothing,

How can I get the result?

600

asked Jul 10 '13 07:07

hh54188

2 Answers

Terminal prints the result because the default log level is set to DEBUG.

When you are running your spider from the script and call log.start(), the default log level is set to INFO.

Just replace:

log.start()

with

log.start(loglevel=log.DEBUG)

UPD:

To get the result as string, you can log everything to a file and then read from it, e.g.:

log.start(logfile="results.log", loglevel=log.DEBUG, crawler=crawler, logstdout=False)

reactor.run()

with open("results.log", "r") as f:
    result = f.read()
print result

Hope that helps.

176

answered Oct 15 '22 10:10

alecxe

I found your question while asking myself the same thing, namely: "How can I get the result?". Since this wasn't answered here I endeavoured to find the answer myself and now that I have I can share it:

items = []
def add_item(item):
    items.append(item)
dispatcher.connect(add_item, signal=signals.item_passed)

Or for scrapy 0.22 (http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script) replace the last line of my solution by:

crawler.signals.connect(add_item, signals.item_passed)

My solution is freely adapted from http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/.

answered Oct 15 '22 09:10

Ixio

Related questions
                            
                                In python, how do i extract a sublist from a list of strings by matching a string pattern in the original list
                            
                                More Pythonic way of counting things in a heavily nested defaultdict
                            
                                matplotlib:plot a line closed
                            
                                Named non-capturing group in python?
                            
                                Python - Add boolean condition to generator
                            
                                Flask-SQLAlchemy TimeoutError
                            
                                Append data to existing pytables table
                            
                                I want to loop 100 times but not print 100 times in the loop (Python)
                            
                                Why are parent constructors not called when instantiating a class? [duplicate]
                            
                                Django + Heroku + Mandrill mail_admins() not working, either manually or as triggered by 500 error
                            
                                Integer array to string in Python
                            
                                Unable to return a value from a function
                            
                                Getting adjective from an adverb in nltk or other NLP library
                            
                                Using MySQLdb module with Pypy compiler
                            
                                Output Multi-line strings with Python Flask or other framework
                            
                                Python - parse IPv4 addresses from string (even when censored)
                            
                                Getting a grid of a matrix via logical indexing in Numpy
                            
                                How to find if excel cell is a date
                            
                                os.path.join() and os.path.normpath() both add double backwards slash on windows [duplicate]
                            
                                Read every second line and print to new file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With