Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

legal or ethical pitfalls for web crawler? [closed]

Tags:

web-crawler

I've been tasked with automating the comparison of a client's inventories from several unrelated web storefronts. These storefronts don't offer APIs, so I'm forced to write a crawler in python which will catalog and compare available products and prices between three websites on a weekly basis.

Should I expect the crawler's IP address to be banned or could legal complaints be made against the source? It seems pretty innocuous (about 500 http page requests separated by one second per request, performed once a week), but this is brand new territory for me.

like image 568
Fancypants_MD Avatar asked Jan 12 '11 00:01

Fancypants_MD


People also ask

Is Web crawling ethical?

Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers' behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.

Is Web crawling legal?

If you're doing web crawling for your own purposes, it is legal as it falls under fair use doctrine. The complications start if you want to use scraped data for others, especially commercial purposes. Quoted from Wikipedia.org, 100 F. Supp.

What is the main purpose of a web crawler program?

A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.

How do web crawlers work?

A web crawler works by discovering URLs and reviewing and categorizing web pages. Along the way, they find hyperlinks to other webpages and add them to the list of pages to crawl next. Web crawlers are smart and can determine the importance of each web page.


3 Answers

Ethical: You should comply with the robots.txt protocol to ensure that you comply with the site-owners' wishes. The Python standard library includes the robotparser module for this purpose.

like image 133
Jim Avatar answered Oct 09 '22 18:10

Jim


Yes you should (expect to be IP banned for screen-scraping for unauthorised syndication). Moreover, the less scrupulous, more creative site owners will, instead of blocking your robot, either attempt to crash/confuse it by sending it malformed data, or deliberately send it false data.

If your business model is based on unauthorised screen-scraping, it will fail.

Normally, it is in the site owners' interests to allow you to screen-scrape, so you can get permission (they are unlikely to make a stable API for you though unless you pay them lots of money to do so).

If they don't give you permission, you should probably not.

Some tips:

  • Give admins of authorised syndication sites a mechanism to ask you to stop scraping their site, in case your bot causes them operational problems. This could be an email address, but please monitor it.
  • If you cannot contact the site owner to get permission, make sure it is easy for them to contact you should the need arise (put a URL or email address in the robot's UA string)
  • Make it clear what the purpose of your screen-scraping is, and what your retention and other policies are.

If you do it all in good faith, transparently, you are unlikely to be blocked by a human unless they decide what you're doing is fundamentally against their business model.

If you behave in an underhand, cloak-and-dagger way, you can expect hostility.

like image 43
MarkR Avatar answered Oct 09 '22 18:10

MarkR


Also note that some data are proprietary and is considered by their owners as Intellectual Property. Some sites like currency exchange sites, search engines and stock market trackers particularly don't like their data being crawled since their business is basically selling the very data you're crawling.

That being said, in the US, you cannot copyright data itself - just how you format the data. So according to US law it's OK to grab crawled data as long as you don't store it in its original formatting (HTML).

But, in a lot of European countries data itself can be copyrighted. And the web is a global beast. People from Europe can visit your site. Which according to the law in some countries means that you are doing business in those countries. So even if you are protected legally in the US it doesn't mean that you won't get sued elsewhere in the world.

My advice is go through the site and read about usage policy. If the site explicitly disallows crawling then you shouldn't do it. And as Jim mentioned, respect robots.txt.

Then again, there is ample legal precedent from courts around the world that makes search engines legal. And search engines are themselves voracious web crawlers. On the other hand it looks like almost every year at least one news agency sues or tries to sue Google for web crawling.

With all the above in mind, be very careful what you do with crawled data. I would say private use is OK as long as you don't overload the servers. I myself do it regularly to get TV programming schedule etc.

like image 40
slebetman Avatar answered Oct 09 '22 18:10

slebetman