I hope to get a list of scraped items in main script instead of using scrapy shell.
I know there is a method parse
in class FooSpider
I define, and this method return a list of Item
. Scrapy framework calls this method. But, how can I get this returned list by myself?
I found so many posts about that, but I don't understand what they were saying.
As a context, I put official example code here
import scrapy
from tutorial.items import DmozItem
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/",
]
def parse(self, response):
for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
result = []
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
result.append(item)
return result
How could I get returned result
from a main python script like main.py
or run.py
?
if __name__ == "__main__":
...
result = xxxx()
for item in result:
print item
Could anyone provide a code snippet in which I get this returned list
from somewhere?
Thank you very much!
This is an example how you can collect all items in a list with a Pipeline:
#!/usr/bin/python3
# Scrapy API imports
import scrapy
from scrapy.crawler import CrawlerProcess
# your spider
from FollowAllSpider import FollowAllSpider
# list to collect all items
items = []
# pipeline to fill the items list
class ItemCollectorPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
items.append(item)
# create a crawler process with the specified settings
process = CrawlerProcess({
'USER_AGENT': 'scrapy',
'LOG_LEVEL': 'INFO',
'ITEM_PIPELINES': { '__main__.ItemCollectorPipeline': 100 }
})
# start the spider
process.crawl(FollowAllSpider)
process.start()
# print the items
for item in items:
print("url: " + item['url'])
You can get FollowAllSpider
from here, or use your own spider. Example output when using it with my webpage:
$ ./crawler.py
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Versions: lxml 3.7.1.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.5.3 (default, Jan 19 2017, 14:11:04) - [GCC 6.3.0 20170118], pyOpenSSL 16.2.0 (OpenSSL 1.1.0f 25 May 2017), cryptography 1.7.1, Platform Linux-4.9.0-6-amd64-x86_64-with-debian-9.5
2018-09-16 22:28:09 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'scrapy', 'LOG_LEVEL': 'INFO'}
[...]
2018-09-16 22:28:15 [scrapy.core.engine] INFO: Spider closed (finished)
url: http://www.frank-buss.de/
url: http://www.frank-buss.de/impressum.html
url: http://www.frank-buss.de/spline.html
url: http://www.frank-buss.de/schnecke/index.html
url: http://www.frank-buss.de/solitaire/index.html
url: http://www.frank-buss.de/forth/index.html
url: http://www.frank-buss.de/pi.tex
[...]
If what you want is to work with/process/transform or store the items you should look into the Item Pipeline and the usual scrapy crawl would do the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With