Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ethics of robots.txt [closed]

Tags:

robots.txt

People also ask

Is robots txt legally binding?

txt be used in a court of law? There is no law stating that /robots. txt must be obeyed, nor does it constitute a binding contract between site owner and user, but having a /robots.

Does Google respect robots txt?

Google officially announced that GoogleBot will no longer obey a Robots. txt directive related to indexing. Publishers relying on the robots. txt noindex directive have until September 1, 2019 to remove it and begin using an alternative.

Is robots txt a vulnerability?

txt does not in itself present any kind of security vulnerability. However, it is often used to identify restricted or private areas of a site's contents.

Can robots txt be ignored?

By default, our crawler honors and respects all robots. txt exclusion requests. However on a case by case basis, you can set up rules to ignore robots. txt blocks for specific sites.


Arguments:

  1. A robots.txt file is an implied license, especially since you are aware of it. Thus, continuing to scrape their site could be seen as unauthorized access (i.e., hacking). Sucks, but arguments like this have been made in other legal cases recently (not directly related to robots.txt, but in relation to other "passive controls".)
  2. Grabbing prices violates no copyright law, including DMCA, since copyright does not include factual information, only creative.
  3. Ethically, you should not grab prices because the vendor should have the ability to change prices without worrying about being accused of a bait/switch by people coming from your site.
  4. Have you taken the high road, explaining the site to them and saying you'd love to include them in your list of vendors? Maybe they will love the idea and actually expose the data in a way that is easy for you to consume and less resource-intensive for them to produce.
  5. There are no laws written directly about robots.txt because netiquette is generally followed. Don't be one of the "bad guys."
  6. Some people filter robots because they use URL links to perform "actions" like adding things to carts, and robots leave them with massive numbers of abandoned shopping carts in their database.
  7. Some people filter robots because they have exclusive prices that they can't advertise openly based on agreements with their vendors. You could be putting them in a bad position by exposing those prices on your site.
  8. In this economy, if a company doesn't want to do everything possible to advertise themselves, it's their own fault that you don't include them.

The other use of robots.txt is to help protect web spiders from themselves. It's relatively easy for a web spider to get mired in an infinitely deep forest of links, and a properly constructed robots.txt file will tell the spider that "you don't need to go here".


Many people have tried to build businesses off building "price comparison" engines that scraped major sites.

Once you start getting any sort of traffic/revenue to speak of, you will receive a cease and desist. It's happened to dozens, if not hundreds of projects. I even worked on a small project that received a C&D from Craigslist.

You know how they say "It's easier to ask forgiveness than it is to get permission"? It doesn't hold true with page scraping. Get permission, or you will be hearing from their lawyers.

If you're lucky, it'll be early on, when you've got nothing to lose. If it's late, you may lose your business and all your work overnight, with a single letter.

Getting permission shouldn't be hard. Unless you're doing something sneaky, you're likely going to drive them additional traffic. Hell, once your product takes off, sites may be begging you, or even paying you to add their data.


"No" means "no".


One reason we allow robots to dig through the web without complaint is that we have a way to stop them if we want to. Protects both sides.

Remember the uproar when Cuil's robots were accused of going over-the-top, apparently acting like a DoS attack in some cases and using up bandwidth allowances of some small sites?

If too many people violate robots.txt we might get something worse.


To answer the narrow question, for the price comparison website you're probably best grabbing the price in real time, rather then scrapping the database in advance. Hard to imagine that being a problem.