This might be a subquestion of Passing arguments to process.crawl in Scrapy python but the author marked the answer (that doesn't answer the subquestion i'm asking myself) as a satisfying one.
Here's my problem : I cannot use scrapy crawl mySpider -a start_urls(myUrl) -o myData.json
Instead i want/need to use crawlerProcess.crawl(spider)
I have already figured out several way to pass the arguments (and anyway it is answered in the question I linked) but i can't grasp how i am supposed to tell it to dump the data into myData.json... the -o myData.json
part
Anyone got a suggestion ? Or am I just not understanding how it is supposed to work..?
Here is the code :
crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()
spider = challenges(start_urls=["http://www.myUrl.html"])
crawlerProcess.crawl(spider)
#For now i am just trying to get that bit of code to work but obviously it will become a loop later.
dispatcher.connect(handleSpiderIdle, signals.spider_idle)
log.start()
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."
Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.
The first and simplest way to create a CSV file of the data you have scraped, is to simply define a output path when starting your spider in the command line. To save to a CSV file add the flag -o to the scrapy crawl command along with the file path you want to save the file to.
The Scrapy shell is an interactive shell where you can try and debug your scraping code very quickly, without having to run the spider. It's meant to be used for testing data extraction code, but you can actually use it for testing any kind of code as it is also a regular Python shell.
You need to specify it on the settings:
process = CrawlerProcess({
'FEED_URI': 'file:///tmp/export.json',
})
process.crawl(MySpider)
process.start()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With