How to get scraped items from main script using scrapy?

Question

I hope to get a list of scraped items in main script instead of using scrapy shell.

I know there is a method parse in class FooSpider I define, and this method return a list of Item. Scrapy framework calls this method. But, how can I get this returned list by myself?

I found so many posts about that, but I don't understand what they were saying.

As a context, I put official example code here

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        result = []
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            result.append(item)

        return result

How could I get returned result from a main python script like main.py or run.py?

if __name__ == "__main__":
    ...
    result = xxxx()
    for item in result:
        print item

Could anyone provide a code snippet in which I get this returned list from somewhere?

Thank you very much!

Frank Buss · Accepted Answer

This is an example how you can collect all items in a list with a Pipeline:

#!/usr/bin/python3

# Scrapy API imports
import scrapy
from scrapy.crawler import CrawlerProcess

# your spider
from FollowAllSpider import FollowAllSpider

# list to collect all items
items = []

# pipeline to fill the items list
class ItemCollectorPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        items.append(item)

# create a crawler process with the specified settings
process = CrawlerProcess({
    'USER_AGENT': 'scrapy',
    'LOG_LEVEL': 'INFO',
    'ITEM_PIPELINES': { '__main__.ItemCollectorPipeline': 100 }
})

# start the spider
process.crawl(FollowAllSpider)
process.start()

# print the items
for item in items:
    print("url: " + item['url'])

You can get FollowAllSpider from here, or use your own spider. Example output when using it with my webpage:

$ ./crawler.py 
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Versions: lxml 3.7.1.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.5.3 (default, Jan 19 2017, 14:11:04) - [GCC 6.3.0 20170118], pyOpenSSL 16.2.0 (OpenSSL 1.1.0f  25 May 2017), cryptography 1.7.1, Platform Linux-4.9.0-6-amd64-x86_64-with-debian-9.5
2018-09-16 22:28:09 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'scrapy', 'LOG_LEVEL': 'INFO'}
[...]
2018-09-16 22:28:15 [scrapy.core.engine] INFO: Spider closed (finished)
url: http://www.frank-buss.de/
url: http://www.frank-buss.de/impressum.html
url: http://www.frank-buss.de/spline.html
url: http://www.frank-buss.de/schnecke/index.html
url: http://www.frank-buss.de/solitaire/index.html
url: http://www.frank-buss.de/forth/index.html
url: http://www.frank-buss.de/pi.tex
[...]

Wilfredo · Answer

If what you want is to work with/process/transform or store the items you should look into the Item Pipeline and the usual scrapy crawl would do the trick.

How to get scraped items from main script using scrapy?

Tags:

python

scrapy

KyL

2 Answers

Frank Buss

Wilfredo

Recent Activity

Donate For Us

How to get scraped items from main script using scrapy?

Tags:

python

scrapy

KyL

2 Answers

Frank Buss

Wilfredo

Related questions

Recent Activity

Donate For Us