Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stopping scripters from slamming your website

How about implementing something like SO does with the CAPTCHAs?

If you're using the site normally, you'll probably never see one. If you happen to reload the same page too often, post successive comments too quickly, or something else that triggers an alarm, make them prove they're human. In your case, this would probably be constant reloads of the same page, following every link on a page quickly, or filling in an order form too fast to be human.

If they fail the check x times in a row (say, 2 or 3), give that IP a timeout or other such measure. Then at the end of the timeout, dump them back to the check again.


Since you have unregistered users accessing the site, you do have only IPs to go on. You can issue sessions to each browser and track that way if you wish. And, of course, throw up a human-check if too many sessions are being (re-)created in succession (in case a bot keeps deleting the cookie).

As far as catching too many innocents, you can put up a disclaimer on the human-check page: "This page may also appear if too many anonymous users are viewing our site from the same location. We encourage you to register or login to avoid this." (Adjust the wording appropriately.)

Besides, what are the odds that X people are loading the same page(s) at the same time from one IP? If they're high, maybe you need a different trigger mechanism for your bot alarm.


Edit: Another option is if they fail too many times, and you're confident about the product's demand, to block them and make them personally CALL you to remove the block.

Having people call does seem like an asinine measure, but it makes sure there's a human somewhere behind the computer. The key is to have the block only be in place for a condition which should almost never happen unless it's a bot (e.g. fail the check multiple times in a row). Then it FORCES human interaction - to pick up the phone.

In response to the comment of having them call me, there's obviously that tradeoff here. Are you worried enough about ensuring your users are human to accept a couple phone calls when they go on sale? If I were so concerned about a product getting to human users, I'd have to make this decision, perhaps sacrificing a (small) bit of my time in the process.

Since it seems like you're determined to not let bots get the upper hand/slam your site, I believe the phone may be a good option. Since I don't make a profit off your product, I have no interest in receiving these calls. Were you to share some of that profit, however, I may become interested. As this is your product, you have to decide how much you care and implement accordingly.


The other ways of releasing the block just aren't as effective: a timeout (but they'd get to slam your site again after, rinse-repeat), a long timeout (if it was really a human trying to buy your product, they'd be SOL and punished for failing the check), email (easily done by bots), fax (same), or snail mail (takes too long).

You could, of course, instead have the timeout period increase per IP for each time they get a timeout. Just make sure you're not punishing true humans inadvertently.


You need to figure a way to make the bots buy stuff that is massively overpriced: 12mm wingnut: $20. See how many bots snap up before the script-writers decide you're gaming them.

Use the profits to buy more servers and pay for bandwidth.


My solution would be to make screen-scraping worthless by putting in a roughly 10 minute delay for 'bots and scripts.

Here's how I'd do it:

  • Log and identify any repeat hitters.

You don't need to log every IP address on every hit. Only track one out of every 20 hits or so. A repeat offender will still show up in a randomized occassional tracking.

  • Keep a cache of your page from about 10-minutes earlier.

  • When a repeat-hitter/bot hits your site, give them the 10-minute old cached page.

They won't immediately know they're getting an old site. They'll be able to scrape it, and everything, but they won't win any races anymore, because "real people" will have a 10 minute head-start.

Benefits:

  • No hassle or problems for users (like CAPTCHAs).
  • Implemented fully on server-side. (no reliance on Javascript/Flash)
  • Serving up an older, cached page should be less performance intensive than a live page. You may actually decrease the load on your servers this way!

Drawbacks

  • Requires tracking some IP addresses
  • Requires keeping and maintaining a cache of older pages.

What do you think?


Take a look at this article by ned Batchelder here. His article is about stopping spambots, but the same techniques could easily apply to your site.

Rather than stopping bots by having people identify themselves, we can stop the bots by making it difficult for them to make a successful post, or by having them inadvertently identify themselves as bots. This removes the burden from people, and leaves the comment form free of visible anti-spam measures.

This technique is how I prevent spambots on this site. It works. The method described here doesn't look at the content at all.

Some other ideas:

  • Create an official auto-notify mechanism (RSS feed? Twitter?) that people can subscribe to when your product goes on sale. This reduces the need for people to make scripts.
  • Change your obfuscation technique right before a new item goes on sale. So even if the scripters can escalate the arms race, they are always a day behind.

EDIT: To be totally clear, Ned's article above describe methods to prevent the automated PURCHASE of items by preventing a BOT from going through the forms to submit an order. His techniques wouldn't be useful for preventing bots from screen-scraping the home page to determine when a Bandoleer of Carrots comes up for sale. I'm not sure preventing THAT is really possible.

With regard to your comments about the effectiveness of Ned's strategies: Yes, he discusses honeypots, but I don't think that's his strongest strategy. His discussion of the SPINNER is the original reason I mentioned his article. Sorry I didn't make that clearer in my original post:

The spinner is a hidden field used for a few things: it hashes together a number of values that prevent tampering and replays, and is used to obscure field names. The spinner is an MD5 hash of:

  • The timestamp,
  • The client's IP address,
  • The entry id of the blog entry being commented on, and
  • A secret.

Here is how you could implement that at WOOT.com:

Change the "secret" value that is used as part of the hash each time a new item goes on sale. This means that if someone is going to design a BOT to auto-purchase items, it would only work until the next item comes on sale!!

Even if someone is able to quickly re-build their bot, all the other actual users will have already purchased a BOC, and your problem is solved!

The other strategy he discusses is to change the honeypot technique from time to time (again, change it when a new item goes on sale):

  • Use CSS classes (randomized of course) to set the fields or a containing element to display:none.
  • Color the fields the same (or very similar to) the background of the page.
  • Use positioning to move a field off of the visible area of the page.
  • Make an element too small to show the contained honeypot field.
  • Leave the fields visible, but use positioning to cover them with an obscuring element.
  • Use Javascript to effect any of these changes, requiring a bot to have a full Javascript engine.
  • Leave the honeypots displayed like the other fields, but tell people not to enter anything into them.

I guess my overall idea is to CHANGE THE FORM DESIGN when each new item goes on sale. Or at LEAST, change it when a new BOC goes on sale.

Which is what, a couple times/month?

If you accept this answer, will you give me a heads-up on when the next one is due? :)


Q: How would you stop scripters from slamming your site hundreds of times a second?
A: You don't. There is no way to prevent this behavior by external agents.

You could employ a vast array of technology to analyze incoming requests and heuristically attempt to determine who is and isn't human...but it would fail. Eventually, if not immediately.

The only viable long-term solution is to change the game so that the site is not bot-friendly, or is less attractive to scripters.

How do you do that? Well, that's a different question! ;-)

...

OK, some options have been given (and rejected) above. I am not intimately familiar with your site, having looked at it only once, but since people can read text in images and bots cannot easily do this, change the announcement to be an image. Not a CAPTCHA, just an image -

  • generate the image (cached of course) when the page is requested
  • keep the image source name the same, so that doesn't give the game away
  • most of the time the image will have ordinary text in it, and be aligned to appear to be part of the inline HTML page
  • when the game is 'on', the image changes to the announcement text
  • the announcement text reveals a url and/or code that must be manually entered to acquire the prize. CAPTCHA the code if you like, but that's probably not necessary.
  • for additional security, the code can be a one-time token generated specifically for the request/IP/agent, so that repeated requests generate different codes. Or you can pre-generate a bunch of random codes (a one-time pad) if on-demand generation is too taxing.

Run time-trials of real people responding to this, and ignore ('oops, an error occurred, sorry! please try again') responses faster than (say) half of this time. This event should also trigger an alert to the developers that at least one bot has figured out the code/game, so it's time to change the code/game.

Continue to change the game periodically anyway, even if no bots trigger it, just to waste the scripters' time. Eventually the scripters should tire of the game and go elsewhere...we hope ;-)

One final suggestion: when a request for your main page comes in, put it in a queue and respond to the requests in order in a separate process (you may have to hack/extend the web server to do this, but it will likely be worthwhile). If another request from the same IP/agent comes in while the first request is in the queue, ignore it. This should automatically shed the load from the bots.

EDIT: another option, aside from use of images, is to use javascript to fill in the buy/no-buy text; bots rarely interpret javascript, so they wouldn't see it