I've been tasked with automating the comparison of a client's inventories from several unrelated web storefronts. These storefronts don't offer APIs, so I'm forced to write a crawler in python which will catalog and compare available products and prices between three websites on a weekly basis. Should I expect the crawler's IP address to be banned or could legal complaints be made against the source? It seems pretty innocuous (about 500 http page requests separated by one second per request, performed once a week), but this is brand new territory for me.

Yes you should (expect to be IP banned for screen-scraping for unauthorised syndication). Moreover, the less scrupulous, more creative site owners will, instead of blocking your robot, either attempt to crash/confuse it by sending it malformed data, or deliberately send it false data. If your business model is based on unauthorised screen-scraping, it will fail. Normally, it is in the site owners' interests to allow you to screen-scrape, so you can get permission (they are unlikely to make a stable API for you though unless you pay them lots of money to do so). If they don't give you permission, you should probably not. Some tips: <ul> <li>Give admins of authorised syndication sites a mechanism to ask you to stop scraping their site, in case your bot causes them operational problems. This could be an email address, but please monitor it.</li> <li>If you cannot contact the site owner to get permission, make sure it is easy for them to contact you should the need arise (put a URL or email address in the robot's UA string)</li> <li>Make it clear what the purpose of your screen-scraping is, and what your retention and other policies are.</li> </ul> If you do it all in good faith, transparently, you are unlikely to be blocked by a human unless they decide what you're doing is fundamentally against their business model. If you behave in an underhand, cloak-and-dagger way, you can expect hostility.

legal or ethical pitfalls for web crawler? [closed]

Tags:

web-crawler

I've been tasked with automating the comparison of a client's inventories from several unrelated web storefronts. These storefronts don't offer APIs, so I'm forced to write a crawler in python which will catalog and compare available products and prices between three websites on a weekly basis.

Should I expect the crawler's IP address to be banned or could legal complaints be made against the source? It seems pretty innocuous (about 500 http page requests separated by one second per request, performed once a week), but this is brand new territory for me.

568

asked Jan 12 '11 00:01

Fancypants_MD

3 Answers

Ethical: You should comply with the robots.txt protocol to ensure that you comply with the site-owners' wishes. The Python standard library includes the robotparser module for this purpose.

133

answered Oct 09 '22 18:10

Jim

Yes you should (expect to be IP banned for screen-scraping for unauthorised syndication). Moreover, the less scrupulous, more creative site owners will, instead of blocking your robot, either attempt to crash/confuse it by sending it malformed data, or deliberately send it false data.

If your business model is based on unauthorised screen-scraping, it will fail.

Normally, it is in the site owners' interests to allow you to screen-scrape, so you can get permission (they are unlikely to make a stable API for you though unless you pay them lots of money to do so).

If they don't give you permission, you should probably not.

Some tips:

Give admins of authorised syndication sites a mechanism to ask you to stop scraping their site, in case your bot causes them operational problems. This could be an email address, but please monitor it.
If you cannot contact the site owner to get permission, make sure it is easy for them to contact you should the need arise (put a URL or email address in the robot's UA string)
Make it clear what the purpose of your screen-scraping is, and what your retention and other policies are.

If you do it all in good faith, transparently, you are unlikely to be blocked by a human unless they decide what you're doing is fundamentally against their business model.

If you behave in an underhand, cloak-and-dagger way, you can expect hostility.

answered Oct 09 '22 18:10

MarkR

Also note that some data are proprietary and is considered by their owners as Intellectual Property. Some sites like currency exchange sites, search engines and stock market trackers particularly don't like their data being crawled since their business is basically selling the very data you're crawling.

That being said, in the US, you cannot copyright data itself - just how you format the data. So according to US law it's OK to grab crawled data as long as you don't store it in its original formatting (HTML).

But, in a lot of European countries data itself can be copyrighted. And the web is a global beast. People from Europe can visit your site. Which according to the law in some countries means that you are doing business in those countries. So even if you are protected legally in the US it doesn't mean that you won't get sued elsewhere in the world.

My advice is go through the site and read about usage policy. If the site explicitly disallows crawling then you shouldn't do it. And as Jim mentioned, respect robots.txt.

Then again, there is ample legal precedent from courts around the world that makes search engines legal. And search engines are themselves voracious web crawlers. On the other hand it looks like almost every year at least one news agency sues or tries to sue Google for web crawling.

With all the above in mind, be very careful what you do with crawled data. I would say private use is OK as long as you don't overload the servers. I myself do it regularly to get TV programming schedule etc.

answered Oct 09 '22 18:10

slebetman

Related questions
                            
                                What does "Allow: /$" mean in robots.txt
                            
                                how to use two level proxy setting in Python?
                            
                                How to limit number of followed pages per site in Python Scrapy
                            
                                Does any open, simply extendible web crawler exists?
                            
                                PhantomJS using too many threads
                            
                                Scrapy - Follow RSS links
                            
                                BOT/Spider Trap Ideas
                            
                                htmlunit Cannot read property "push" from undefined
                            
                                Scraping text in h3 and div tags using beautifulSoup, Python
                            
                                JTidy or Jsoup for Java
                            
                                Mass Downloading of Webpages C#
                            
                                Scrapy parse javascript
                            
                                Typical politeness factor for a web crawler?
                            
                                How can scrapy be used to extract the link graph of a website?
                            
                                Using selenium: How to keep logged in after closing Driver in Python
                            
                                Removing all spaces in text file with Python 3.x
                            
                                How to include the start url in the "allow" rule in SgmlLinkExtractor using a scrapy crawl spider
                            
                                how to ban crawler 360Spider with robots.txt or .htaccess?
                            
                                Storing URLs while Spidering
                            
                                Ban robots from website [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

legal or ethical pitfalls for web crawler? [closed]

Tags:

web-crawler

Fancypants_MD

People also ask

3 Answers

Jim

MarkR

slebetman

Recent Activity

Donate For Us