Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What would be the most ethical way to consume content from a site that is not providing an API? [closed]

I was wondering what would be the most ethical way to consume some bytes (386 precisely) of content from a given Site A, with an application (e.g. Google App Engine) in some Site B, but doing it right, no scraping intended, I really just need to check the status of a public service and they're currently not providing any API. So the markup in Site A has a JavaScript array with the info I need and being able to access that let's say once every five minutes would suffice.

Any advice will be much appreciated.

UPDATE:

First all thanks much for the feedback. Site A is basically the website of the company that currently runs our public subway network, so I'm planning to develop a tiny free Android app for anyone to have not only a map with the whole network and its stations but also updated information about the availability of the service (and those are the bytes I will eventually be consuming), etcétera.

like image 923
Nano Taboada Avatar asked Jun 18 '11 06:06

Nano Taboada


4 Answers

There will be some very differents points of view, but hopefully here is some food for thought:

  1. Ask the site owner first, if they know ahead of time they are less likely to be annoyed.
  2. Is the content on Site A accessible on a public part of the site, e.g. without the need to log in?
  3. If the answer to #2 is that it is public content, then I wouldn't see an issue, as scraping the site for that information is really no different then pointing your browser at the site and reading it for yourself.
  4. Of course, the answer to #3 is dependent on how the site is monetised. If Site A provides advertistment for generating revenue for the site, then it might not be an idea to start scraping content, as you would be bypassing how the site makes money.

I think the most important thing to do, is talk to the site owner first, and determine straight from them if:

  1. Is it ok for me to be scraping content from their site.
  2. Do they have an API in the pipeline (simply highlighting the desire may prompt them to consider it).

Just my point of view...

like image 61
Matthew Abbott Avatar answered Oct 20 '22 00:10

Matthew Abbott


Update (4 years later): The question specifically embraces the ethical side of the problem. That's why this old answer is written in this way.

Typically in such situation you contact them.

If they don't like it, then ethically you can't do it (legally is another story, depending on providing license on the site or not. what login/anonymousity or other restrictions they have for access, do you have to use test/fake data, etc...).

If they allow it, they may provide an API (might involve costs - will be up to you to determine how much the fature is worth to your app), or promise some sort of expected behavior for you, which might itself be scrapping, or whatever other option they decide.

If they allow it but not ready to help make it easier, then scraping (with its other downsides still applicable) will be right, at least "ethically".

like image 22
Meligy Avatar answered Oct 19 '22 23:10

Meligy


I would not touch it save for emailing the site admin, then getting their written permission. That being said -- if you're consuming the content yet not extracting value beyond the value a single user gets when observing the data you need from them, it's arguable that any TOU they have wouldn't find you in violation. If however you get noteworthy value beyond what a single user would get from the data you need from their site -- ie., let's say you use the data then your results end up providing value to 100x of your own site's users -- I'd say you need express permission to do that, to sleep well at night.

All that's off however if the info is already in the public domain (and you can prove it), or the data you need from them is under some type of 'open license' such as from GNU.

Then again, the web is nothing without links to others' content. We all capture then re-post stuff on various forums, say -- we read an article on cnn then comment on it in an online forum, maybe quote the article, and provide a link back to it. Just depends I guess on how flexible and open-minded the site's admin and owner are. But really, to avoid being sued (if push comes to shove) I'd get permission.

like image 25
wantTheBest Avatar answered Oct 20 '22 00:10

wantTheBest


  1. Use a user-agent header which identifies your service.
  2. Check their robots.txt (and re-check it at regular intervals, e.g. daily).
  3. Respect any Disallow in a record that matches your user agent (be liberal in interpreting the name). If there is no record for your user-agent, use the record for User-agent: *.
  4. Respect the (non-standard) Crawl-delay, which tells you how many seconds you should wait before requesting a resource from that host again.
like image 20
unor Avatar answered Oct 20 '22 01:10

unor