Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Could a web-scraper get around a good throttle protection?

Suppose that a data source sets a tight IP-based throttle. Would a web scraper have any way to download the data if the throttle starts rejecting their requests as early as 1% of the data being downloaded?

The only technique I could think of a hacker using here would be some sort of proxy system. But, it seems like the proxies (even if fast) would eventually all reach the throttle.

Update: Some people below have mentioned big proxy networks like Yahoo Pipes and Tor, but couldn't these IP ranges or known exit nodes be blacklisted as well?

like image 622
bgcode Avatar asked Feb 01 '11 21:02

bgcode


3 Answers

A list of thousands or poxies can be compiled for FREE. IPv6 addresses can be rented for pennies. Hell, an attacker could boot up an Amazon EC2 micro instance for 2-7 cents an hour.

And you want to stop people from scraping your site? The internet doesn't work that way, and hopefully it never will.

(I have seen IRC servers do a port scan on clients to see if the following ports are open: 8080,3128,1080. However there are proxy servers that use different ports and there are also legit reasons to run proxy server or to have these ports open, like if you are running Apache Tomcat. You could bump it up a notch by using YAPH to see if a client is running a proxy server. In effect you'd be using an attacker's too against them ;)

like image 169
rook Avatar answered Oct 01 '22 23:10

rook


Someone using Tor would be hopping IP addresses every few minutes. I used to run a website where this was a problem, and resorted to blocking the IP addresses of known Tor exit nodes whenever excessive scraping was detected. You can implement this if you can find a regularly updated list of Tor exit nodes, for example, https://www.dan.me.uk/tornodes

like image 41
Paul Dixon Avatar answered Oct 01 '22 23:10

Paul Dixon


You could use a P2P crawling network to accomplish this task. There will be a lot of IPs availble and there will be no problem if one of them become throttled. Also, you may combine a lot of client instances using some proxy configuration as suggested in previous answers.

I think you can use YaCy, a P2P opensource crawling network.

like image 29
Herberth Amaral Avatar answered Oct 01 '22 23:10

Herberth Amaral