Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get scraped items from main script using scrapy?

Tags:

python

scrapy

I hope to get a list of scraped items in main script instead of using scrapy shell.

I know there is a method parse in class FooSpider I define, and this method return a list of Item. Scrapy framework calls this method. But, how can I get this returned list by myself?

I found so many posts about that, but I don't understand what they were saying.

As a context, I put official example code here

import scrapy

from tutorial.items import DmozItem

class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/",
    ]

    def parse(self, response):
        for href in response.css("ul.directory.dir-col > li > a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        result = []
        for sel in response.xpath('//ul/li'):
            item = DmozItem()
            item['title'] = sel.xpath('a/text()').extract()
            item['link'] = sel.xpath('a/@href').extract()
            item['desc'] = sel.xpath('text()').extract()
            result.append(item)

        return result

How could I get returned result from a main python script like main.py or run.py?

if __name__ == "__main__":
    ...
    result = xxxx()
    for item in result:
        print item

Could anyone provide a code snippet in which I get this returned list from somewhere?

Thank you very much!

like image 298
KyL Avatar asked Jul 04 '16 15:07

KyL


2 Answers

This is an example how you can collect all items in a list with a Pipeline:

#!/usr/bin/python3

# Scrapy API imports
import scrapy
from scrapy.crawler import CrawlerProcess

# your spider
from FollowAllSpider import FollowAllSpider

# list to collect all items
items = []

# pipeline to fill the items list
class ItemCollectorPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        items.append(item)

# create a crawler process with the specified settings
process = CrawlerProcess({
    'USER_AGENT': 'scrapy',
    'LOG_LEVEL': 'INFO',
    'ITEM_PIPELINES': { '__main__.ItemCollectorPipeline': 100 }
})

# start the spider
process.crawl(FollowAllSpider)
process.start()

# print the items
for item in items:
    print("url: " + item['url'])

You can get FollowAllSpider from here, or use your own spider. Example output when using it with my webpage:

$ ./crawler.py 
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-09-16 22:28:09 [scrapy.utils.log] INFO: Versions: lxml 3.7.1.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.5.3 (default, Jan 19 2017, 14:11:04) - [GCC 6.3.0 20170118], pyOpenSSL 16.2.0 (OpenSSL 1.1.0f  25 May 2017), cryptography 1.7.1, Platform Linux-4.9.0-6-amd64-x86_64-with-debian-9.5
2018-09-16 22:28:09 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'scrapy', 'LOG_LEVEL': 'INFO'}
[...]
2018-09-16 22:28:15 [scrapy.core.engine] INFO: Spider closed (finished)
url: http://www.frank-buss.de/
url: http://www.frank-buss.de/impressum.html
url: http://www.frank-buss.de/spline.html
url: http://www.frank-buss.de/schnecke/index.html
url: http://www.frank-buss.de/solitaire/index.html
url: http://www.frank-buss.de/forth/index.html
url: http://www.frank-buss.de/pi.tex
[...]
like image 119
Frank Buss Avatar answered Sep 28 '22 02:09

Frank Buss


If what you want is to work with/process/transform or store the items you should look into the Item Pipeline and the usual scrapy crawl would do the trick.

like image 35
Wilfredo Avatar answered Sep 28 '22 02:09

Wilfredo