I was wondering what would be the most ethical way to consume some bytes (386 precisely) of content from a given Site A, with an application (e.g. Google App Engine) in some Site B, but doing it right, no scraping intended, I really just need to check the status of a public service and they're currently not providing any API. So the markup in Site A has a JavaScript array with the info I need and being able to access that let's say once every five minutes would suffice.
Any advice will be much appreciated.
UPDATE:
First all thanks much for the feedback. Site A is basically the website of the company that currently runs our public subway network, so I'm planning to develop a tiny free Android app for anyone to have not only a map with the whole network and its stations but also updated information about the availability of the service (and those are the bytes I will eventually be consuming), etcétera.
There will be some very differents points of view, but hopefully here is some food for thought:
I think the most important thing to do, is talk to the site owner first, and determine straight from them if:
Just my point of view...
Update (4 years later): The question specifically embraces the ethical side of the problem. That's why this old answer is written in this way.
Typically in such situation you contact them.
If they don't like it, then ethically you can't do it (legally is another story, depending on providing license on the site or not. what login/anonymousity or other restrictions they have for access, do you have to use test/fake data, etc...).
If they allow it, they may provide an API (might involve costs - will be up to you to determine how much the fature is worth to your app), or promise some sort of expected behavior for you, which might itself be scrapping, or whatever other option they decide.
If they allow it but not ready to help make it easier, then scraping (with its other downsides still applicable) will be right, at least "ethically".
I would not touch it save for emailing the site admin, then getting their written permission. That being said -- if you're consuming the content yet not extracting value beyond the value a single user gets when observing the data you need from them, it's arguable that any TOU they have wouldn't find you in violation. If however you get noteworthy value beyond what a single user would get from the data you need from their site -- ie., let's say you use the data then your results end up providing value to 100x of your own site's users -- I'd say you need express permission to do that, to sleep well at night.
All that's off however if the info is already in the public domain (and you can prove it), or the data you need from them is under some type of 'open license' such as from GNU.
Then again, the web is nothing without links to others' content. We all capture then re-post stuff on various forums, say -- we read an article on cnn then comment on it in an online forum, maybe quote the article, and provide a link back to it. Just depends I guess on how flexible and open-minded the site's admin and owner are. But really, to avoid being sued (if push comes to shove) I'd get permission.
Disallow
in a record that matches your user agent (be liberal in interpreting the name). If there is no record for your user-agent, use the record for User-agent: *
.Crawl-delay
, which tells you how many seconds you should wait before requesting a resource from that host again.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With