This might be a subquestion of Passing arguments to process.crawl in Scrapy python but the author marked the answer (that doesn't answer the subquestion i'm asking myself) as a satisfying one. Here's my problem : I cannot use <code>scrapy crawl mySpider -a start_urls(myUrl) -o myData.json</code> Instead i want/need to use <code>crawlerProcess.crawl(spider)</code> I have already figured out several way to pass the arguments (and anyway it is answered in the question I linked) but i can't grasp how i am supposed to tell it to dump the data into myData.json... the <code>-o myData.json</code> part Anyone got a suggestion ? Or am I just not understanding how it is supposed to work..? Here is the code : <pre class="prettyprint"><code>crawlerProcess = CrawlerProcess(settings) crawlerProcess.install() crawlerProcess.configure() spider = challenges(start_urls=["http://www.myUrl.html"]) crawlerProcess.crawl(spider) #For now i am just trying to get that bit of code to work but obviously it will become a loop later. dispatcher.connect(handleSpiderIdle, signals.spider_idle) log.start() print "Starting crawler." crawlerProcess.start() print "Crawler stopped." </code></pre>

You need to specify it on the settings: <pre class="prettyprint"><code>process = CrawlerProcess({ 'FEED_URI': 'file:///tmp/export.json', }) process.crawl(MySpider) process.start() </code></pre>

Scrapy process.crawl() to export data to json

Tags:

python

json

scrapy

web-crawler

This might be a subquestion of Passing arguments to process.crawl in Scrapy python but the author marked the answer (that doesn't answer the subquestion i'm asking myself) as a satisfying one.

Here's my problem : I cannot use scrapy crawl mySpider -a start_urls(myUrl) -o myData.json
Instead i want/need to use crawlerProcess.crawl(spider) I have already figured out several way to pass the arguments (and anyway it is answered in the question I linked) but i can't grasp how i am supposed to tell it to dump the data into myData.json... the -o myData.json part
Anyone got a suggestion ? Or am I just not understanding how it is supposed to work..?

Here is the code :

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

spider = challenges(start_urls=["http://www.myUrl.html"])
crawlerProcess.crawl(spider)
#For now i am just trying to get that bit of code to work but obviously it will become a loop later.

dispatcher.connect(handleSpiderIdle, signals.spider_idle)

log.start()
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

447

asked Jun 17 '16 08:06

Carele

1 Answers

You need to specify it on the settings:

process = CrawlerProcess({
    'FEED_URI': 'file:///tmp/export.json',
})

process.crawl(MySpider)
process.start()

144

answered Sep 25 '22 05:09

eLRuLL

Related questions
                            
                                What does ${python3:Depends} mean in a debian source-package control file?
                            
                                attributeError: can't set attribute with flask-SQLAlchemy [duplicate]
                            
                                Error Installing Pyproj in Python 3.5
                            
                                Rearrange a pandas data frame to create a 2d ratings matrix
                            
                                Accelerating one-to-many correlation calculations in Python
                            
                                Feeding a Python array into a Perl script
                            
                                PyImport_ImportModule, possible to load module from memory?
                            
                                Normalize the elements of columns in an array to 1 or -1 depending on their sign
                            
                                Passing Python functions as objects to Spark
                            
                                How can I slice a dataframe by timestamp, when timestamp isn't classified as index?
                            
                                Complex non-greedy matching with regular expressions
                            
                                How to return primary keys generated from a COPY FROM statement in postgreSQL?
                            
                                Differences between Cython, extending C/C++ with Python.h, etc
                            
                                Detecting Peaks in a FFT Plot
                            
                                Filter pandas dataframe rows by multiple column values
                            
                                Get difference in days between today's date and previous date
                            
                                Selenium-Firefox: Firefox browser crashes when running my Selenium script [duplicate]
                            
                                Executing same Python program with different arguments in parallel
                            
                                matplotlib arrowheads and aspect ratio
                            
                                Python + SOAP : The message with Action \'\' cannot be processed at the receiver, due to a ContractFilter mismatch at the EndpointDispatcher

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With