I am scraping 23770 webpages with a pretty simple web scraper using <code>scrapy</code>. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages). I have looked on the <code>scrapy</code> webpage and the mailing lists and <code>stackoverflow</code>, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome! I have listed my code below, if it's needed. <pre class="prettyprint"><code>from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from scrapy.item import Item, Field import re class Sale(Item): Adresse = Field() Pris = Field() Salgsdato = Field() SalgsType = Field() KvmPris = Field() Rum = Field() Postnummer = Field() Boligtype = Field() Kvm = Field() Bygget = Field() class HouseSpider(BaseSpider): name = 'House' allowed_domains = ["http://boliga.dk/"] start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)] def parse(self, response): hxs = HtmlXPathSelector(response) sites = hxs.select("id('searchresult')/tr") items = [] for site in sites: item = Sale() item['Adresse'] = site.select("td[1]/a[1]/text()").extract() item['Pris'] = site.select("td[2]/text()").extract() item['Salgsdato'] = site.select("td[3]/text()").extract() Temp = site.select("td[4]/text()").extract() Temp = Temp[0] m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp) if m: found = m.group(1) item['SalgsType'] = found else: item['SalgsType'] = Temp item['KvmPris'] = site.select("td[5]/text()").extract() item['Rum'] = site.select("td[6]/text()").extract() item['Postnummer'] = site.select("td[7]/text()").extract() item['Boligtype'] = site.select("td[8]/text()").extract() item['Kvm'] = site.select("td[9]/text()").extract() item['Bygget'] = site.select("td[10]/text()").extract() items.append(item) return items </code></pre> Thanks!

Here's a collection of things to try: <ul> <li>use latest scrapy version (if not using already)</li> <li>check if non-standard middlewares are used</li> <li>try to increase <code>CONCURRENT_REQUESTS_PER_DOMAIN</code>, <code>CONCURRENT_REQUESTS</code> settings (docs)</li> <li>turn off logging <code>LOG_ENABLED = False</code> (docs)</li> <li>try <code>yield</code>ing an item in a loop instead of collecting items into the <code>items</code> list and returning them</li> <li>use local cache DNS (see this thread)</li> <li>check if this site is using download threshold and limits your download speed (see this thread)</li> <li>log cpu and memory usage during the spider run - see if there are any problems there</li> <li>try run the same spider under scrapyd service</li> <li>see if grequests + lxml will perform better (ask if you need any help with implementing this solution)</li> <li>try running <code>Scrapy</code> on <code>pypy</code>, see Running Scrapy on PyPy </li> </ul> Hope that helps.

Speed up web scraper

Tags:

performance

python

web-scraping

scrapy

scrapy-spider

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).

I have looked on the scrapy webpage and the mailing lists and stackoverflow, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!

I have listed my code below, if it's needed.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re

class Sale(Item):
    Adresse = Field()
    Pris = Field()
    Salgsdato = Field()
    SalgsType = Field()
    KvmPris = Field()
    Rum = Field()
    Postnummer = Field()
    Boligtype = Field()
    Kvm = Field()
    Bygget = Field()

class HouseSpider(BaseSpider):
    name = 'House'
    allowed_domains = ["http://boliga.dk/"]
    start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("id('searchresult')/tr")
        items = []      
        for site in sites:
            item = Sale()
            item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
            item['Pris'] = site.select("td[2]/text()").extract()
            item['Salgsdato'] = site.select("td[3]/text()").extract()
            Temp = site.select("td[4]/text()").extract()
            Temp = Temp[0]
            m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
            if m:
                found = m.group(1)
                item['SalgsType'] = found
            else:
                item['SalgsType'] = Temp
            item['KvmPris'] = site.select("td[5]/text()").extract()
            item['Rum'] = site.select("td[6]/text()").extract()
            item['Postnummer'] = site.select("td[7]/text()").extract()
            item['Boligtype'] = site.select("td[8]/text()").extract()
            item['Kvm'] = site.select("td[9]/text()").extract()
            item['Bygget'] = site.select("td[10]/text()").extract()
            items.append(item)
        return items

Thanks!

464

asked Jun 10 '13 17:06

Mace

2 Answers

Here's a collection of things to try:

use latest scrapy version (if not using already)
check if non-standard middlewares are used
try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
turn off logging LOG_ENABLED = False (docs)
try yielding an item in a loop instead of collecting items into the items list and returning them
use local cache DNS (see this thread)
check if this site is using download threshold and limits your download speed (see this thread)
log cpu and memory usage during the spider run - see if there are any problems there
try run the same spider under scrapyd service
see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
try running Scrapy on pypy, see Running Scrapy on PyPy

Hope that helps.

answered Oct 13 '22 01:10

alecxe

Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

answered Oct 13 '22 00:10

Capi Etheriel

Related questions
                            
                                How to get the max/min value in Pandas DataFrame when nan value in it
                            
                                Ipython cv2.imwrite() not saving image
                            
                                Can't update to numpy 1.13 with anaconda?
                            
                                Pylint warning for "useless super delegation"
                            
                                SSLError("bad handshake") when trying to access resources Custom Certificates and Requests
                            
                                How to convert UTC to EST with Python and take care of daylight saving automatically?
                            
                                Sort a list of lists by length and value in Python
                            
                                How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?
                            
                                How to set environment variables in virtualenv
                            
                                Get Max value comparing multiple columns and return specific values
                            
                                Is there a Python library function which attempts to guess the character-encoding of some bytes? [duplicate]
                            
                                Tkinter: invoke event in main loop
                            
                                Downloading a directory tree with ftplib
                            
                                Prevent function overriding in Python [duplicate]
                            
                                how to match whitespace and alphanumeric characters in python
                            
                                libmysqlclient.18.dylib image not found when using MySQL from Django on OS X
                            
                                django global variable
                            
                                Matplotlib: Color-coded text in legend instead of a line
                            
                                How to install win32com module in a virtualenv?
                            
                                Search File And Find Exact Match And Print Line?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With