I am scraping 23770 webpages with a pretty simple web scraper using scrapy
. I am quite new to scrapy and even python, but managed to write a spider that does the job. It is, however, really slow (it takes approx. 28 hours to crawl the 23770 pages).
I have looked on the scrapy
webpage and the mailing lists and stackoverflow
, but I can't seem to find generic recommendations for writing fast crawlers understandable for beginners. Maybe my problem is not the spider itself, but the way i run it. All suggestions welcome!
I have listed my code below, if it's needed.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re
class Sale(Item):
Adresse = Field()
Pris = Field()
Salgsdato = Field()
SalgsType = Field()
KvmPris = Field()
Rum = Field()
Postnummer = Field()
Boligtype = Field()
Kvm = Field()
Bygget = Field()
class HouseSpider(BaseSpider):
name = 'House'
allowed_domains = ["http://boliga.dk/"]
start_urls = ['http://www.boliga.dk/salg/resultater?so=1&type=Villa&type=Ejerlejlighed&type=R%%C3%%A6kkehus&kom=&amt=&fraPostnr=&tilPostnr=&iPostnr=&gade=&min=&max=&byggetMin=&byggetMax=&minRooms=&maxRooms=&minSize=&maxSize=&minsaledate=1992&maxsaledate=today&kode=&p=%d' %n for n in xrange(1, 23770, 1)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("id('searchresult')/tr")
items = []
for site in sites:
item = Sale()
item['Adresse'] = site.select("td[1]/a[1]/text()").extract()
item['Pris'] = site.select("td[2]/text()").extract()
item['Salgsdato'] = site.select("td[3]/text()").extract()
Temp = site.select("td[4]/text()").extract()
Temp = Temp[0]
m = re.search('\r\n\t\t\t\t\t(.+?)\r\n\t\t\t\t', Temp)
if m:
found = m.group(1)
item['SalgsType'] = found
else:
item['SalgsType'] = Temp
item['KvmPris'] = site.select("td[5]/text()").extract()
item['Rum'] = site.select("td[6]/text()").extract()
item['Postnummer'] = site.select("td[7]/text()").extract()
item['Boligtype'] = site.select("td[8]/text()").extract()
item['Kvm'] = site.select("td[9]/text()").extract()
item['Bygget'] = site.select("td[10]/text()").extract()
items.append(item)
return items
Thanks!
Web scraping using multithreading Multithreading is a great option to optimize web scraping code. A thread is essentially a separate flow of execution. Operating systems typically spawn hundreds of threads and switch the CPU time among these. The switching is so fast that we get the illusion of multitasking.
Typically, a serial web scraper will make requests in a loop, one after the other, with each request taking 2-3 seconds to complete.
Good news for archivists, academics, researchers and journalists: Scraping publicly accessible data is legal, according to a U.S. appeals court ruling.
Scraping is one of the methods malicious hackers use to collect intel on companies before they target them with more significant attacks. Here is a closer look at this undervalued threat. Web scraping can easily lead to more significant attacks.
Here's a collection of things to try:
CONCURRENT_REQUESTS_PER_DOMAIN
, CONCURRENT_REQUESTS
settings (docs)LOG_ENABLED = False
(docs)yield
ing an item in a loop instead of collecting items into the items
list and returning themScrapy
on pypy
, see Running Scrapy on PyPy
Hope that helps.
Looking at your code, I'd say most of that time is spent in network requests rather than processing the responses. All of the tips @alecxe provides in his answer apply, but I'd suggest the HTTPCACHE_ENABLED
setting, since it caches the requests and avoids doing it a second time. It would help on following crawls and even offline development. See more info in the docs: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.contrib.downloadermiddleware.httpcache
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With