Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up web scraper

I am scraping 23770 webpages with a pretty simple web scraper using scrapy. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).

I have looked on the scrapy webpage and the mailing lists and stackoverflow, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!

I have listed my code below, if it's needed.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re

class Sale(Item):
    Adresse = Field()
    Pris = Field()
    Salgsdato = Field()
    SalgsType = Field()
    KvmPris = Field()
    Rum = Field()
    Postnummer = Field()
    Boligtype = Field()
    Kvm = Field()
    Bygget = Field()

class HouseSpider(BaseSpider):
    name = 'House'
    allowed_domains = ["http://boliga.dk/"]
    start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select("id('searchresult')/tr")
        items = []      
        for site in sites:
            item = Sale()
            item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
            item['Pris'] = site.select("td[2]/text()").extract()
            item['Salgsdato'] = site.select("td[3]/text()").extract()
            Temp = site.select("td[4]/text()").extract()
            Temp = Temp[0]
            m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
            if m:
                found = m.group(1)
                item['SalgsType'] = found
            else:
                item['SalgsType'] = Temp
            item['KvmPris'] = site.select("td[5]/text()").extract()
            item['Rum'] = site.select("td[6]/text()").extract()
            item['Postnummer'] = site.select("td[7]/text()").extract()
            item['Boligtype'] = site.select("td[8]/text()").extract()
            item['Kvm'] = site.select("td[9]/text()").extract()
            item['Bygget'] = site.select("td[10]/text()").extract()
            items.append(item)
        return items

Thanks!

like image 464
Mace Avatar asked Jun 10 '13 17:06

Mace


People also ask

How can I speed up web scraping?

Web scraping using multithreading Multithreading is a great option to optimize web scraping code. A thread is essentially a separate flow of execution. Operating systems typically spawn hundreds of threads and switch the CPU time among these. The switching is so fast that we get the illusion of multitasking.

How long do web scrapers take?

Typically, a serial web scraper will make requests in a loop, one after the other, with each request taking 2-3 seconds to complete.

Can you scrape websites legally?

Good news for archivists, academics, researchers and journalists: Scraping publicly accessible data is legal, according to a U.S. appeals court ruling.

Do hackers use web scraping?

Scraping is one of the methods malicious hackers use to collect intel on companies before they target them with more significant attacks. Here is a closer look at this undervalued threat. Web scraping can easily lead to more significant attacks.


2 Answers

Here's a collection of things to try:

  • use latest scrapy version (if not using already)
  • check if non-standard middlewares are used
  • try to increase CONCURRENT_REQUESTS_PER_DOMAIN, CONCURRENT_REQUESTS settings (docs)
  • turn off logging LOG_ENABLED = False (docs)
  • try yielding an item in a loop instead of collecting items into the items list and returning them
  • use local cache DNS (see this thread)
  • check if this site is using download threshold and limits your download speed (see this thread)
  • log cpu and memory usage during the spider run - see if there are any problems there
  • try run the same spider under scrapyd service
  • see if grequests + lxml will perform better (ask if you need any help with implementing this solution)
  • try running Scrapy on pypy, see Running Scrapy on PyPy

Hope that helps.

like image 71
alecxe Avatar answered Oct 13 '22 01:10

alecxe


Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache

like image 5
Capi Etheriel Avatar answered Oct 13 '22 00:10

Capi Etheriel