I've been tasked with automating the comparison of a client's inventories from several unrelated web storefronts. These storefronts don't offer APIs, so I'm forced to write a crawler in python which will catalog and compare available products and prices between three websites on a weekly basis.
Should I expect the crawler's IP address to be banned or could legal complaints be made against the source? It seems pretty innocuous (about 500 http page requests separated by one second per request, performed once a week), but this is brand new territory for me.
Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.
If you're doing web crawling for your own purposes, it is legal as it falls under fair use doctrine. The complications start if you want to use scraped data for others, especially commercial purposes. Quoted from Wikipedia.org, 100 F. Supp.
A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.
A web crawler works by discovering URLs and reviewing and categorizing web pages. Along the way, they find hyperlinks to other webpages and add them to the list of pages to crawl next. Web crawlers are smart and can determine the importance of each web page.
Ethical: You should comply with the robots.txt protocol to ensure that you comply with the site-owners' wishes. The Python standard library includes the robotparser module for this purpose.
Yes you should (expect to be IP banned for screen-scraping for unauthorised syndication). Moreover, the less scrupulous, more creative site owners will, instead of blocking your robot, either attempt to crash/confuse it by sending it malformed data, or deliberately send it false data.
If your business model is based on unauthorised screen-scraping, it will fail.
Normally, it is in the site owners' interests to allow you to screen-scrape, so you can get permission (they are unlikely to make a stable API for you though unless you pay them lots of money to do so).
If they don't give you permission, you should probably not.
Some tips:
If you do it all in good faith, transparently, you are unlikely to be blocked by a human unless they decide what you're doing is fundamentally against their business model.
If you behave in an underhand, cloak-and-dagger way, you can expect hostility.
Also note that some data are proprietary and is considered by their owners as Intellectual Property. Some sites like currency exchange sites, search engines and stock market trackers particularly don't like their data being crawled since their business is basically selling the very data you're crawling.
That being said, in the US, you cannot copyright data itself - just how you format the data. So according to US law it's OK to grab crawled data as long as you don't store it in its original formatting (HTML).
But, in a lot of European countries data itself can be copyrighted. And the web is a global beast. People from Europe can visit your site. Which according to the law in some countries means that you are doing business in those countries. So even if you are protected legally in the US it doesn't mean that you won't get sued elsewhere in the world.
My advice is go through the site and read about usage policy. If the site explicitly disallows crawling then you shouldn't do it. And as Jim mentioned, respect robots.txt.
Then again, there is ample legal precedent from courts around the world that makes search engines legal. And search engines are themselves voracious web crawlers. On the other hand it looks like almost every year at least one news agency sues or tries to sue Google for web crawling.
With all the above in mind, be very careful what you do with crawled data. I would say private use is OK as long as you don't overload the servers. I myself do it regularly to get TV programming schedule etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With