Get past request limit in crawling a web site

2 Answers

OK, first and foremost: if a website doesn't want you to crawl it too often then you shouldn't! It's basic politeness and you should always try to adhere to it.

However, I do understand that there are some websites, like Google, who make their money by crawling your website all day long and when you try to crawl Google, then they block you.

Solution 1: Proxy Servers

In any case, the alternative to getting a bunch of EC2 machines is to get proxy servers. Proxy servers are MUCH cheaper than EC2, case and point: http://5socks.net/en_proxy_socks_tarifs.htm

Of course, proxy servers are not as fast as EC2 (bandwidth wise), but you should be able to strike a balance where you're getting similar or higher throughput than your 50 EC2 instances for substantially less than what you're paying now. This involves you searching for affordable proxies and finding ones that will give you similar results. A thing to note here is that just like you, there may be other people using the proxy service to crawl the website you're crawling and they may not be as smart about how they crawl it, so the whole proxy service can get blocked due to the activity of some other client of the proxy service (I've personally seen it).

Solution 2: You-Da-Proxy!

This is a little crazy and I haven't done the math behind this, but you could start a proxy service yourself and sell proxy services to others. You can't use all of your EC2 machine's bandwidth anyway, so the best way for you to cut cost is to do what Amazon does: sub-lease the hardware.

answered Sep 27 '22 19:09

Kiril

Using proxies is, by far, the most common way to tackle this problem. There are other higher-level solutions that provide a sort of "page downloading as a service" guaranteeing you get "clean" pages (not 404s, etc). One of these is called Crawlera (provided by my company) but there may be others.

answered Sep 27 '22 20:09

Pablo Hoffman

Related questions
                            
                                Simple web crawler in C#
                            
                                crawl site that has infinite scrolling using python
                            
                                Concurrent downloads - Python
                            
                                How can use scrapy shell with url and basic auth credentials?
                            
                                How do you turn a dynamic site into a static site that can be demo'd from a CD?
                            
                                How to crawl entire Wikipedia?
                            
                                How to extend Nutch for article crawling
                            
                                Facebook requests for {url}/no_facebook_preview_picture.jpg on 404 links
                            
                                golang force net/http client to use IPv4 / IPv6
                            
                                How to run apache nutch different jobs in parallel manner
                            
                                cant set Host in CURL PHP
                            
                                What database for crawler/scraper?
                            
                                Do modern web crawlers use the click event or navigate directly to href on anchor tags?
                            
                                NodeJS x-ray web-scraper: how to follow links and get content from sub page
                            
                                get out links from nutch
                            
                                Scrapy SgmlLinkExtractor is ignoring allowed links
                            
                                Is there a hashing algorithm that is tolerant of minor differences?
                            
                                Crawling the Google Play store
                            
                                Crawl specific pages and data and make it searchable [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get past request limit in crawling a web site

Tags:

distributed-computing

web-crawler

brandon

People also ask