Could a web-scraper get around a good throttle protection?

Question

Suppose that a data source sets a tight IP-based throttle. Would a web scraper have any way to download the data if the throttle starts rejecting their requests as early as 1% of the data being downloaded?

The only technique I could think of a hacker using here would be some sort of proxy system. But, it seems like the proxies (even if fast) would eventually all reach the throttle.

Update: Some people below have mentioned big proxy networks like Yahoo Pipes and Tor, but couldn't these IP ranges or known exit nodes be blacklisted as well?

rook · Accepted Answer

A list of thousands or poxies can be compiled for FREE. IPv6 addresses can be rented for pennies. Hell, an attacker could boot up an Amazon EC2 micro instance for 2-7 cents an hour.

And you want to stop people from scraping your site? The internet doesn't work that way, and hopefully it never will.

(I have seen IRC servers do a port scan on clients to see if the following ports are open: 8080,3128,1080. However there are proxy servers that use different ports and there are also legit reasons to run proxy server or to have these ports open, like if you are running Apache Tomcat. You could bump it up a notch by using YAPH to see if a client is running a proxy server. In effect you'd be using an attacker's too against them ;)

Paul Dixon · Answer

Someone using Tor would be hopping IP addresses every few minutes. I used to run a website where this was a problem, and resorted to blocking the IP addresses of known Tor exit nodes whenever excessive scraping was detected. You can implement this if you can find a regularly updated list of Tor exit nodes, for example, https://www.dan.me.uk/tornodes

Herberth Amaral · Answer

You could use a P2P crawling network to accomplish this task. There will be a lot of IPs availble and there will be no problem if one of them become throttled. Also, you may combine a lot of client instances using some proxy configuration as suggested in previous answers.

I think you can use YaCy, a P2P opensource crawling network.

Could a web-scraper get around a good throttle protection?

Tags:

http

security

web-scraping

bgcode

3 Answers

rook

Paul Dixon

Herberth Amaral

Recent Activity

Donate For Us

Could a web-scraper get around a good throttle protection?

Tags:

http

security

web-scraping

bgcode

3 Answers

rook

Paul Dixon

Herberth Amaral

Related questions

Recent Activity

Donate For Us